Publications

Yuen, Ivan; Möbius, Bernd; Andreeva, Bistra; Sabev, Mitko

How do word frequency and syllable surprisal affect response time and acoustic duration in sentence formulation? Inproceedings Forthcoming

Interspeech 2026, Sydney, Australia, 2026.

@inproceedings{Yuen_etal_2026:Interspeech,
title = {How do word frequency and syllable surprisal affect response time and acoustic duration in sentence formulation?},
author = {Ivan Yuen and Bernd M{\"o}bius and Bistra Andreeva and Mitko Sabev},
year = {2026},
date = {2026},
booktitle = {Interspeech 2026},
address = {Sydney, Australia},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Dyer, Andrew

What does Surprisal have to do with Information Status? Inproceedings

Vylomova, Ekaterina; Shcherbakov, Andrei; Rani, Priya (Ed.): Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual {NLP}, Association for Computational Linguistics, pp. 26-31, Rabat, Morocco, 2026, ISBN 979-8-89176-374-6.

It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.

@inproceedings{dyer-2026-surprisal,
title = {What does Surprisal have to do with Information Status?},
author = {Andrew Dyer},
editor = {Ekaterina Vylomova and Andrei Shcherbakov and Priya Rani},
url = {https://aclanthology.org/2026.sigtyp-main.4/},
doi = {https://doi.org/10.18653/v1/2026.sigtyp-main.4},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual {NLP}},
isbn = {979-8-89176-374-6},
pages = {26-31},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Schacht, Carmen; Landwehr, Isabell

CoBra: A Compound Branching Resource for Nominal Triconstituent Compounds in English and German Inproceedings

Proceedings of the Ninth Workshop on Universal Dependencies (UDW, LREC 2026), pp. 128-141, Palma de Mallorca, Spain, 2026.

We present CoBra, a resource containing triconstituent nominal compounds in English and German. This addresses an understudied aspect of compound processing, since research and resources in psycholinguistics and NLP have mostly focused on two-constituent compounds. In addition, our resource covers both general and scientific language, allowing for a register-informed perspective on compounds. It provides syntactic and semantic annotation of compound structure, in particular of the branching direction (i.e. the internal embedding structure, the Compound Branching) and the semantic relationship between constituents. Annotations are implemented using extensions of Universal Dependencies (UD) labels. To explore applications of our new resource, we also conduct a pilot study investigating the relationship between semantic transparency and branching direction. Our results indicate that there is indeed a correlation. Overall, our resource contributes to gaining a more detailed understanding of the structure and processing of morphologically complex words within the UD framework.

@inproceedings{Schacht_etal_2026:Cobra,
title = {CoBra: A Compound Branching Resource for Nominal Triconstituent Compounds in English and German},
author = {Carmen Schacht and Isabell Landwehr},
url = {http://lrec-conf.org/proceedings/lrec2026/workshops/udw/2026.udw-1.0.pdf},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Ninth Workshop on Universal Dependencies (UDW, LREC 2026)},
pages = {128-141},
address = {Palma de Mallorca, Spain},
abstract = {We present CoBra, a resource containing triconstituent nominal compounds in English and German. This addresses an understudied aspect of compound processing, since research and resources in psycholinguistics and NLP have mostly focused on two-constituent compounds. In addition, our resource covers both general and scientific language, allowing for a register-informed perspective on compounds. It provides syntactic and semantic annotation of compound structure, in particular of the branching direction (i.e. the internal embedding structure, the Compound Branching) and the semantic relationship between constituents. Annotations are implemented using extensions of Universal Dependencies (UD) labels. To explore applications of our new resource, we also conduct a pilot study investigating the relationship between semantic transparency and branching direction. Our results indicate that there is indeed a correlation. Overall, our resource contributes to gaining a more detailed understanding of the structure and processing of morphologically complex words within the UD framework.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 C6

Shaik, Mohammed Maqsood ; Klakow, Dietrich; Abdullah, Badr M.

Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification Inproceedings

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 11436-11440, 2026.

Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.

@inproceedings{Shaik_etal_2024:Identification,
title = {Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification},
author = {Mohammed Maqsood Shaik and Dietrich Klakow and Badr M. Abdullah},
url = {https://arxiv.org/abs/2312.07338},
doi = {https://doi.org/10.48550/arXiv.2312.07338},
year = {2026},
date = {2026},
booktitle = {ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages = {11436-11440},
abstract = {Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multilingual tasks such as SLID, the pre-trained speech model may not support all the languages in the downstream task. To address this challenge, we propose self-supervised adaptive pre-training (SAPT) to adapt the pre-trained model to the target domain and languages of the downstream task. We apply SAPT to the XLSR-128 model and investigate the effectiveness of this approach for the SLID task. First, we demonstrate that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages. Second, we apply SAPT on four different datasets in a few-shot learning setting, showing that our approach improves the sample efficiency of XLSR during fine-tuning. Our experiments provide strong empirical evidence that continual adaptation via self-supervision improves downstream performance for multilingual speech models.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Lee, Tyler; Stenger, Irina; Avgustinova, Tania

Linguistic and Demographic Factors in an Online Free Translation Task Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), pp. 8587-8595, Palma, Mallorca, Spain, 2026.

Humans are remarkably adept of understanding unfamiliar languages, in part by utilizing resources from languages they do know. In this study, we investigated how various linguistic factors (word order, lexical distance) and demographic factors affected the speed and correctness of translations in a multilingual scenario. In free translation task conducted online, participants read Polish noun phrases and translated them into English text. The noun phrases were varied between noun-adjective and adjective-noun word order, and the number of international words varied among the stimuli. Both the accuracy and total response time were recorded, and additional demographic data was recorded for all participants. Participants were more successful at translating noun phrases composed of two international terms than those with one or no such words. Additionally, speakers of other Slavic languages were more accurate despite not knowing Polish than participants who knew no Slavic languages. Although word order had little or no effect on accuracy for participants overall, speakers of Slavic languages translated the noun-adjective stimuli more accurately overall.

@inproceedings{lee-etal-2026-linguistic,
title = {Linguistic and Demographic Factors in an Online Free Translation Task},
author = {Tyler Lee and Irina Stenger and Tania Avgustinova},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-678},
doi = {https://doi.org/10.63317/58knfppiwdz3},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {8587-8595},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {Humans are remarkably adept of understanding unfamiliar languages, in part by utilizing resources from languages they do know. In this study, we investigated how various linguistic factors (word order, lexical distance) and demographic factors affected the speed and correctness of translations in a multilingual scenario. In free translation task conducted online, participants read Polish noun phrases and translated them into English text. The noun phrases were varied between noun-adjective and adjective-noun word order, and the number of international words varied among the stimuli. Both the accuracy and total response time were recorded, and additional demographic data was recorded for all participants. Participants were more successful at translating noun phrases composed of two international terms than those with one or no such words. Additionally, speakers of other Slavic languages were more accurate despite not knowing Polish than participants who knew no Slavic languages. Although word order had little or no effect on accuracy for participants overall, speakers of Slavic languages translated the noun-adjective stimuli more accurately overall.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Yuen, Ivan; Möbius, Bernd; Andreeva, Bistra; Sabev, Mitko

What kind of informativity could drive phonetic duration modification: Focus structure or semantic likelihood? Inproceedings

Speech Prosody 2026, pp. 540-544, 2026, ISSN 2333-2042.

Prosody is used to signal ‘informativity’, which has been separately approached in terms of information structure or information theory. As pointed out in [1], few studies combined both viewpoints in examining prosodic encoding. [1] used meaning-based contextual probability as an information-theoretical measure in an experiment and reported its influence on the fundamental frequency contour in different focus conditions in American English. A recent study of broadcast data in German also observed contributions of information status and trigram surprisal (i.e. structure-based) on syllable duration [2]. Inspired by [1], the current study revisited the role of information structure and information theory on prosodic encoding in German, by using a reading-aloud production experiment, and a meaning-based information-theoretic measure (i.e. likelihood of semantic association between two nouns) as in [1]. We hypothesized that (1) a focused component will exhibit longer duration than its non-focused counterpart, (2) a repeated word will have short duration and (3) a semantically less likely N1-N2 pairing will attenuate their durational differences and therefore attenuate the prominence relationship in each focus structure. Based on data from 10 participants, our preliminary findings provide support for (1), partial support for (2) but no support for (3).

@inproceedings{yuen26_speechprosody,
title = {What kind of informativity could drive phonetic duration modification: Focus structure or semantic likelihood?},
author = {Ivan Yuen and Bernd M{\"o}bius and Bistra Andreeva and Mitko Sabev},
url = {https://www.isca-archive.org/speechprosody_2026/yuen26_speechprosody.html},
doi = {https://doi.org/10.21437/SpeechProsody.2026-109},
year = {2026},
date = {2026},
booktitle = {Speech Prosody 2026},
issn = {2333-2042},
pages = {540-544},
abstract = {

Prosody is used to signal ‘informativity’, which has been separately approached in terms of information structure or information theory. As pointed out in [1], few studies combined both viewpoints in examining prosodic encoding. [1] used meaning-based contextual probability as an information-theoretical measure in an experiment and reported its influence on the fundamental frequency contour in different focus conditions in American English. A recent study of broadcast data in German also observed contributions of information status and trigram surprisal (i.e. structure-based) on syllable duration [2]. Inspired by [1], the current study revisited the role of information structure and information theory on prosodic encoding in German, by using a reading-aloud production experiment, and a meaning-based information-theoretic measure (i.e. likelihood of semantic association between two nouns) as in [1]. We hypothesized that (1) a focused component will exhibit longer duration than its non-focused counterpart, (2) a repeated word will have short duration and (3) a semantically less likely N1-N2 pairing will attenuate their durational differences and therefore attenuate the prominence relationship in each focus structure. Based on data from 10 participants, our preliminary findings provide support for (1), partial support for (2) but no support for (3).
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Kunilovskaya, Maria; Pollkläsener, Christina

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), pp. 6998-7013, Palma, Mallorca, Spain, 2026.

This paper introduces an updated and combined version of the bidirectional English–German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.

@inproceedings{kunilovskaya-etal-2026-epic,
title = {EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting},
author = {Maria Kunilovskaya and Christina Pollkl{\"a}sener},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-557},
doi = {https://doi.org/10.63317/3txs6tgs4wsu},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {6998-7013},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {

This paper introduces an updated and combined version of the bidirectional English–German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Tan, David; Chen, Pinzhen; van Genabith, Josef; Dutta Chowdhury, Koel

When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation Inproceedings

Demberg, Vera; Inui, Kentaro; Marquez, Lluís (Ed.): Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, pp. 345-358, Rabat, Morocco, 2026, ISBN 979-8-89176-381-4.

Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to „uncontaminated“ languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz’s FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.

@inproceedings{tan-etal-2026-flores,
title = {When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation},
author = {David Tan and Pinzhen Chen and Josef van Genabith and Koel Dutta Chowdhury},
editor = {Vera Demberg and Kentaro Inui and Llu{\'i}s Marquez},
url = {https://aclanthology.org/2026.eacl-short.26/},
doi = {https://doi.org/10.18653/v1/2026.eacl-short.26},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)},
isbn = {979-8-89176-381-4},
pages = {345-358},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {Large language models (LLMs) can be benchmark-contaminated, resulting in inflated scores that mask memorization as generalization, and in multilingual settings, this memorization can even transfer to "uncontaminated" languages. Using the FLORES-200 translation benchmark as a diagnostic, we study two 7-8B instruction-tuned multilingual LLMs: Bloomz, which was trained on FLORES, and Llama as an uncontaminated control. We confirm Bloomz’s FLORES contamination and demonstrate that machine translation contamination can be cross-directional, artificially boosting performance in unseen translation directions due to target-side memorization. Further analysis shows that recall of memorized references often persists despite various source-side perturbation efforts like paraphrasing and named entity replacement. However, replacing named entities leads to a consistent decrease in BLEU, suggesting an effective probing method for memorization in contaminated models.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Osmelak, Doreen; Dutta Chowdhury, Koel; Sentsova, Uliana; España-Bonet, Cristina; van Genabith, Josef

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), pp. 8756-8766, Palma, Mallorca, Spain, 2026.

Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PeTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus consists of 2,900 sentence pairs from TED-Multi and Europarl, covers twelve language pairs, and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidates through null alignments and refine them using active learning with human annotation. Our results show that entity and system-level (e.g., metric conversions) explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 for the best transfer languages. PeTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation.

@inproceedings{osmelak-etal-2026-petra,
title = {PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation},
author = {Doreen Osmelak and Koel Dutta Chowdhury and Uliana Sentsova and Cristina Espa{\~n}a-Bonet and Josef van Genabith},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-689},
doi = {https://doi.org/10.63317/56tberz7nmwy},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {8756-8766},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {

Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PeTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus consists of 2,900 sentence pairs from TED-Multi and Europarl, covers twelve language pairs, and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidates through null alignments and refine them using active learning with human annotation. Our results show that entity and system-level (e.g., metric conversions) explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 for the best transfer languages. PeTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Yung, Frances Pik Yu; Ignatev, Daniil; Scholman, Merel; Demberg, Vera; Poesio, Massimo

Human label variation in implicit discourse relation recognition Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) , European Language Resources Association (ELRA), pp. 4942-4954, Palma, Mallorca, Spain, 2026.

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

@inproceedings{yung-etal-2026-human,
title = {Human label variation in implicit discourse relation recognition},
author = {Frances Pik Yu Yung and Daniil Ignatev and Merel Scholman and Vera Demberg and Massimo Poesio},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-388},
doi = {https://doi.org/10.63317/3nah4z4ha8r4},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {4942-4954},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Suresh, Varsha; Mughal, Muhammad Hamza; Theobalt, Christian; Demberg, Vera

Modeling Turn-Taking with Semantically Informed Gestures Inproceedings

Demberg, Vera; Inui, Kentaro; Marquez, Lluís (Ed.): Findings of the Association for Computational Linguistics: EACL 2026, Association for Computational Linguistics, pp. 2034-2041, Rabat, Morocco, 2026, ISBN 979-8-89176-386-9.

In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

@inproceedings{suresh-etal-2026-modeling,
title = {Modeling Turn-Taking with Semantically Informed Gestures},
author = {Varsha Suresh and Muhammad Hamza Mughal and Christian Theobalt and Vera Demberg},
editor = {Vera Demberg and Kentaro Inui and Llu{\'i}s Marquez},
url = {https://aclanthology.org/2026.findings-eacl.106/},
doi = {https://doi.org/10.18653/v1/2026.findings-eacl.106},
year = {2026},
date = {2026},
booktitle = {Findings of the Association for Computational Linguistics: EACL 2026},
isbn = {979-8-89176-386-9},
pages = {2034-2041},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Bagdasarov, Sergei; Alves, Diego; Fischer, Stefan; Teich, Elke

Using LLMs for Automatic Discipline Annotation in a Diachronic Corpus of English Scientific Papers Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), pp. 2376--2386, Palma, Mallorca, Spain, 2026.

This study investigates the potential of generative large language models (LLMs) to automatically identify the disciplines of scientific papers in the Royal Society Corpus (RSC) – an extensive collection of English scientific publications spanning more than three centuries. We evaluated eight open-source, state-of-the-art LLMs from four model families on a manually annotated subset and further validated the three best-performing models on a corpus of modern scientific texts. These models were subsequently used for large-scale annotation of the RSC. The models exhibited robust and consistent performance, with at least two LLMs agreeing on the same label for 98.3% of the documents. We then conducted an error analysis of papers assigned divergent labels and a diachronic case study of disciplinary trends within the corpus. The error analysis revealed that most discrepancies occurred in twentieth-century texts, reflecting the growing interdisciplinarity of research. The diachronic analysis showed a gradual decline in disciplinary diversity over time as well as fluctuations corresponding to major paradigm shifts such as the Chemical Revolution and key twentieth-century developments in Physics. The discipline labels generated by the three models will be made publicly available.

@inproceedings{bagdasarov-etal-2026-llms,
title = {Using LLMs for Automatic Discipline Annotation in a Diachronic Corpus of English Scientific Papers},
author = {Sergei Bagdasarov and Diego Alves and Stefan Fischer and Elke Teich},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-187},
doi = {https://doi.org/10.63317/3j9wvu86v48t},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {2376--2386},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {

This study investigates the potential of generative large language models (LLMs) to automatically identify the disciplines of scientific papers in the Royal Society Corpus (RSC) – an extensive collection of English scientific publications spanning more than three centuries. We evaluated eight open-source, state-of-the-art LLMs from four model families on a manually annotated subset and further validated the three best-performing models on a corpus of modern scientific texts. These models were subsequently used for large-scale annotation of the RSC. The models exhibited robust and consistent performance, with at least two LLMs agreeing on the same label for 98.3% of the documents. We then conducted an error analysis of papers assigned divergent labels and a diachronic case study of disciplinary trends within the corpus. The error analysis revealed that most discrepancies occurred in twentieth-century texts, reflecting the growing interdisciplinarity of research. The diachronic analysis showed a gradual decline in disciplinary diversity over time as well as fluctuations corresponding to major paradigm shifts such as the Chemical Revolution and key twentieth-century developments in Physics. The discipline labels generated by the three models will be made publicly available.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Alves, Diego; Bagdasarov, Sergei; Teich, Elke

Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal Inproceedings

Kr. Ojha, Atul; Barbu Mititelu, Verginica; Constant, Mathieu; Stoyanova, Ivelina; Seza Doğruöz, A.; Rademaker, Alexandre (Ed.): Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026), Association for Computational Linguistics, pp. 48-53, Rabat, Marocco, 2026, ISBN 979-8-89176-363-0.

This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., due to) and phrasal verbs (e.g., turn out). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.

@inproceedings{alves-etal-2026-cognitive,
title = {Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal},
author = {Diego Alves and Sergei Bagdasarov and Elke Teich},
editor = {Atul Kr. Ojha and Verginica Barbu Mititelu and Mathieu Constant and Ivelina Stoyanova and A. Seza Doğru{\"o}z and Alexandre Rademaker},
url = {https://aclanthology.org/2026.mwe-1.5/},
doi = {https://doi.org/10.18653/v1/2026.mwe-1.5},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)},
isbn = {979-8-89176-363-0},
pages = {48-53},
publisher = {Association for Computational Linguistics},
address = {Rabat, Marocco},
abstract = {This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., due to) and phrasal verbs (e.g., turn out). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Steuer, Julius; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke; Klakow, Dietrich

Modeling the Memory-Surprisal Trade-Off over Time: Communicative Efficiency Decreases with Lexico-Grammatical Change in Scientific English Inproceedings

Piperidis, Stelios; Bel, Núria; van den Heuvel, Henk; Ide, Nancy; Krek, Simon; Toral, Antonio (Ed.): Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), pp. 11309-11319, Palma, Mallorca, Spain, 2026.

The memory-surprisal trade-off (MST) has been shown to hold cross-linguistically as a general principle of communicative efficiency: languages that exhibit information locality tend to have word orders that allow for efficient memory use, i.e., lower surprisal at a fixed memory budget. In this paper, we explore the influence of diachronic variation on the MST. We compare scientific English in the Royal Society Corpus (RSC, 18thc. – 20thc.) to „general language“ in the Corpus of Historical American English (COHA) to assess the impact of intra-linguistic variation (register). We find that both time and register influence the shape of the tradeoff: Over time, vocabulary expansion raises minimal surprisal, while the shape of the MST curves changes. Decreasing distances between syntactic dependencies due to more local nominal encodings change how predictive information is distributed across memory scales. The effects are stronger for the RSC than for COHA.

@inproceedings{steuer-etal-2026-modeling,
title = {Modeling the Memory-Surprisal Trade-Off over Time: Communicative Efficiency Decreases with Lexico-Grammatical Change in Scientific English},
author = {Julius Steuer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich and Dietrich Klakow},
editor = {Stelios Piperidis and Núria Bel and Henk van den Heuvel and Nancy Ide and Simon Krek and Antonio Toral},
url = {https://lrec.elra.info/lrec2026-main-884},
doi = {https://doi.org/10.63317/4txotk7fwkhp},
year = {2026},
date = {2026},
booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)},
pages = {11309-11319},
publisher = {European Language Resources Association (ELRA)},
address = {Palma, Mallorca, Spain},
abstract = {The memory-surprisal trade-off (MST) has been shown to hold cross-linguistically as a general principle of communicative efficiency: languages that exhibit information locality tend to have word orders that allow for efficient memory use, i.e., lower surprisal at a fixed memory budget. In this paper, we explore the influence of diachronic variation on the MST. We compare scientific English in the Royal Society Corpus (RSC, 18thc. – 20thc.) to "general language" in the Corpus of Historical American English (COHA) to assess the impact of intra-linguistic variation (register). We find that both time and register influence the shape of the tradeoff: Over time, vocabulary expansion raises minimal surprisal, while the shape of the MST curves changes. Decreasing distances between syntactic dependencies due to more local nominal encodings change how predictive information is distributed across memory scales. The effects are stronger for the RSC than for COHA.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4

Alves, Diego; Bagdasarov, Sergei; Teich, Elke

Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal Inproceedings

Kr. Ojha, Atul; Barbu Mititelu, Verginica; Constant, Mathieu; Stoyanova, Ivelina; Seza Doğruöz, A.; Rademaker, Alexandre (Ed.): Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026), Association for Computational Linguistics, pp. 48-53, Rabat, Marocco, 2026, ISBN 979-8-89176-363-0.

This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., \textit{due to}) and phrasal verbs (e.g., \textit{turn out}). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.

@inproceedings{alves-etal-2026-cognitive,
title = {Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal},
author = {Diego Alves and Sergei Bagdasarov and Elke Teich},
editor = {Atul Kr. Ojha and Verginica Barbu Mititelu and Mathieu Constant and Ivelina Stoyanova and A. Seza Doğru{\"o}z and Alexandre Rademaker},
url = {https://aclanthology.org/2026.mwe-1.5/},
doi = {https://doi.org/10.18653/v1/2026.mwe-1.5},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)},
isbn = {979-8-89176-363-0},
pages = {48-53},
publisher = {Association for Computational Linguistics},
address = {Rabat, Marocco},
abstract = {This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., \textit{due to}) and phrasal verbs (e.g., \textit{turn out}). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Skrjanec, Iza; Demberg, Vera

Language models that match reader experience are better predictors of reading times Journal Article

Journal of Memory and Language, 146, pp. 104677, 2026, ISSN 0749-596X.

Humans differ in the language experience that they accumulate, due to differing interests, reading habits and profession. This experience can be expected to affect their linguistic expectations when reading texts from domains that are very familiar to them. The present article explores whether language models trained to match the experience of readers produce surprisal estimates that more accurately predict the reading times of those readers than the usually employed general language models. We use a German eye-tracking corpus of biology and physics students reading expository texts from these domains. We adapt a neural language model to the experience of these two groups of readers via two domain adaptation methods and varying amounts of training data. The evaluation against one early and two late reading measures suggests that aligning language models with the readers’ experience to predict the processing effort results in a better fit on late measures than using a model with a high linguistic accuracy. Our findings highlight the opportunities for exploring the cognitive plausibility of language models with respect to psychological constructs.

@article{SKRJANEC2026104677,
title = {Language models that match reader experience are better predictors of reading times},
author = {Iza Skrjanec and Vera Demberg},
url = {https://www.sciencedirect.com/science/article/pii/S0749596X25000701},
doi = {https://doi.org/10.1016/j.jml.2025.104677},
year = {2026},
date = {2026},
journal = {Journal of Memory and Language},
pages = {104677},
volume = {146},
abstract = {Humans differ in the language experience that they accumulate, due to differing interests, reading habits and profession. This experience can be expected to affect their linguistic expectations when reading texts from domains that are very familiar to them. The present article explores whether language models trained to match the experience of readers produce surprisal estimates that more accurately predict the reading times of those readers than the usually employed general language models. We use a German eye-tracking corpus of biology and physics students reading expository texts from these domains. We adapt a neural language model to the experience of these two groups of readers via two domain adaptation methods and varying amounts of training data. The evaluation against one early and two late reading measures suggests that aligning language models with the readers’ experience to predict the processing effort results in a better fit on late measures than using a model with a high linguistic accuracy. Our findings highlight the opportunities for exploring the cognitive plausibility of language models with respect to psychological constructs.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A8

Reich, Ingo; Lemke, Tyll Robin; Schäfer, Lisa

Questions under discussion, salience and the acceptability of fragments Book Chapter

Konietzko, Andreas; Winkler, Susanne;  (Ed.): Information Structure and Discourse in Generative Grammar: Mechanisms and Processes, De Gruyter Mouton, pp. 157–190, Berlin; Boston, 2026.

This paper tests the predictions of the QUD approach (Roberts 1996) with respect to the processing of discourse-initial fragments. To this effect, we present three empirical studies, two production studies and an acceptability rating study. In a pretest, we first estimate the salience of potential QUDs in a given utterance context c (= context-based likelihood). In the first experiment, we then estimate the salience of QUDs given both the utterance context c and the utterance a itself (= answer-based likelihood). In the second experiment subjects rate discourse-initial fragments and sentences a that relate to a QUD with an either high (= predictable) or low (= unpredictable) context-based likelihood. Our results show that the answer-based likelihoods do not predict the ratings, which is surprising given the assumptions of the QUD approach. At the same time the data suggests (i) that subjects generally prefer utterances whose QUD is already salient in the utterance context, and (ii) that retrieving the QUD of fragments with a rather low context-based likelihood requires higher processing effort, which is in turn reflected in degraded ratings.

@inbook{lemke.etalquestions,
title = {Questions under discussion, salience and the acceptability of fragments},
author = {Ingo Reich and Tyll Robin Lemke and Lisa Sch{\"a}fer},
editor = {Andreas Konietzko and Susanne Winkler},
url = {https://www.degruyterbrill.com/de/document/doi/10.1515/9781501514425-006/html},
doi = {https://doi.org/10.1515/9781501514425-006},
year = {2026},
date = {2026},
booktitle = {Information Structure and Discourse in Generative Grammar: Mechanisms and Processes},
pages = {157–190},
publisher = {De Gruyter Mouton},
address = {Berlin; Boston},
abstract = {This paper tests the predictions of the QUD approach (Roberts 1996) with respect to the processing of discourse-initial fragments. To this effect, we present three empirical studies, two production studies and an acceptability rating study. In a pretest, we first estimate the salience of potential QUDs in a given utterance context c (= context-based likelihood). In the first experiment, we then estimate the salience of QUDs given both the utterance context c and the utterance a itself (= answer-based likelihood). In the second experiment subjects rate discourse-initial fragments and sentences a that relate to a QUD with an either high (= predictable) or low (= unpredictable) context-based likelihood. Our results show that the answer-based likelihoods do not predict the ratings, which is surprising given the assumptions of the QUD approach. At the same time the data suggests (i) that subjects generally prefer utterances whose QUD is already salient in the utterance context, and (ii) that retrieving the QUD of fragments with a rather low context-based likelihood requires higher processing effort, which is in turn reflected in degraded ratings.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B3

Belcher, Kate Rebecca; Crocker, Matthew W.

Correlating Language Model Surprisal With Cloze and Plausibility: Getting the Best of Both Measures Inproceedings

Proceeding of the 15th Workshop on Cognitive Modeling and Computational Linguistics (CMCL), pp. 99-109, 2026.

Prediction is central to both expectation-based theories of human language processing (such as Surprisal Theory), and the objective of neural network-based causal language models, where upcoming tokens are predicted based on their preceding context. With this similarity in mind, we investigated how language model predictions align with human linguistic prediction measures. We investigated the extent to which small-sized causal LLMs capture two common proxy measures of human surprisal – cloze probability and plausibility – in their predictive patterns. For this analysis, we created a new dataset of 660 sentence pair items with a minimal triplet design, in which target words vary across the full scale of word predictability, and calculate metric alignment by way of Pearson correlation. We find a stronger overall correlation of LM-surprisal with plausibility than with cloze, and, notably, the relationships between LM-surprisal and each of the two offline measures is found to vary depending on the relative predictability of the target word. We conclude that LM-surprisal offers a distinct perspective as a predictability measure than both offline behavioural measures, and that it may offer a useful tool in teasing apart nuances in predictability in certain instances which are not always captured by cloze probability and plausibility alone.

@inproceedings{belcher2026cmcl,
title = {Correlating Language Model Surprisal With Cloze and Plausibility: Getting the Best of Both Measures},
author = {Kate Rebecca Belcher and Matthew W. Crocker},
url = {http://lrec-conf.org/proceedings/lrec2026/workshops/cmcl/2026.cmcl-1.0.pdf},
year = {2026},
date = {2026},
booktitle = {Proceeding of the 15th Workshop on Cognitive Modeling and Computational Linguistics (CMCL)},
pages = {99-109},
abstract = {Prediction is central to both expectation-based theories of human language processing (such as Surprisal Theory), and the objective of neural network-based causal language models, where upcoming tokens are predicted based on their preceding context. With this similarity in mind, we investigated how language model predictions align with human linguistic prediction measures. We investigated the extent to which small-sized causal LLMs capture two common proxy measures of human surprisal – cloze probability and plausibility – in their predictive patterns. For this analysis, we created a new dataset of 660 sentence pair items with a minimal triplet design, in which target words vary across the full scale of word predictability, and calculate metric alignment by way of Pearson correlation. We find a stronger overall correlation of LM-surprisal with plausibility than with cloze, and, notably, the relationships between LM-surprisal and each of the two offline measures is found to vary depending on the relative predictability of the target word. We conclude that LM-surprisal offers a distinct perspective as a predictability measure than both offline behavioural measures, and that it may offer a useful tool in teasing apart nuances in predictability in certain instances which are not always captured by cloze probability and plausibility alone.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A1

Teich, Elke; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina

Information Theory in Translation and Interpreting Studies Incollection

Reference Module in Social Sciences, Elsevier, 2026, ISBN 978-0-443-15785-1.
This article deals with the application of information theory in empirical, quantitative translation and interpreting studies. In particular, we focus on translationese, i.e., the specific linguistic choices in translation compared to original, non-mediated communication. We show how Shannon’s information theory can be used to (i) explore translationese effects in corpora, (ii) test specific translationese hypotheses (e.g., simplification, explicitation), and (iii) link up with explanations from rational communication, such as processing efficiency and cognitive resource limitations.

@incollection{TEICH2026,
title = {Information Theory in Translation and Interpreting Studies},
author = {Elke Teich and Heike Przybyl and Ekaterina Lapshinova-Koltunski},
url = {https://www.sciencedirect.com/science/article/pii/B9780323955041015830},
doi = {https://doi.org/10.1016/B978-0-323-95504-1.01583-0},
year = {2026},
date = {2026},
booktitle = {Reference Module in Social Sciences},
isbn = {978-0-443-15785-1},
publisher = {Elsevier},
abstract = {

This article deals with the application of information theory in empirical, quantitative translation and interpreting studies. In particular, we focus on translationese, i.e., the specific linguistic choices in translation compared to original, non-mediated communication. We show how Shannon's information theory can be used to (i) explore translationese effects in corpora, (ii) test specific translationese hypotheses (e.g., simplification, explicitation), and (iii) link up with explanations from rational communication, such as processing efficiency and cognitive resource limitations.
},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   B7

Steuer, Julius; Nakai, Toshiki; Dyer, Andrew; Talamo, Luigi; Verkerk, Annemarie

Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus Inproceedings

Vylomova, Ekaterina; Shcherbakov, Andrei; Rani, Priya (Ed.): Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 18-25, Rabat, Morocco, 2026, ISBN 979-8-89176-374-6.

The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.

@inproceedings{steuer-etal-2026-evaluating,
title = {Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus},
author = {Julius Steuer and Toshiki Nakai and Andrew Dyer and Luigi Talamo and Annemarie Verkerk},
editor = {Ekaterina Vylomova and Andrei Shcherbakov and Priya Rani},
url = {https://aclanthology.org/2026.sigtyp-main.3/},
doi = {https://doi.org/10.18653/v1/2026.sigtyp-main.3},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
isbn = {979-8-89176-374-6},
pages = {18-25},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Successfully