Publications

Verkerk, Annemarie; Talamo, Luigi

mini-CIEP+ : A Shareable Parallel Corpus of Prose Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 135-143, Torino, Italia, 2024.

In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.

@inproceedings{verkerk-talamo-2024-mini,
title = {mini-CIEP+ : A Shareable Parallel Corpus of Prose},
author = {Annemarie Verkerk and Luigi Talamo},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.15},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {135-143},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Jablotschkin, Sarah; Teich, Elke; Zinsmeister, Heike

DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis Inproceedings

Raya Chakravarthi, Bharathi; B, Bharathi; Buitelaar, Paul; Durairaj, Thenmozhi; Kovács, György; Ángel García Cumbreras, Miguel (Ed.): Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, Association for Computational Linguistics, pp. 106-117, St. Julians, Malta, 2024.

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

@inproceedings{jablotschkin-etal-2024-de,
title = {DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis},
author = {Sarah Jablotschkin and Elke Teich and Heike Zinsmeister},
editor = {Bharathi Raya Chakravarthi and Bharathi B and Paul Buitelaar and Thenmozhi Durairaj and Gy{\"o}rgy Kov{\'a}cs and Miguel {\'A}ngel Garc{\'i}a Cumbreras},
url = {https://aclanthology.org/2024.ltedi-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion},
pages = {106-117},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   T1

Fischer, Stefan; Haidarzhyi, Kateryna; Knappen, Jörg; Polishchuk, Olha; Stodolinska, Yuliya; Teich, Elke

A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication Inproceedings

Romanyshyn, Mariana; Romanyshyn, Nataliia; Hlybovets, Andrii; Ignatenko, Oleksii (Ed.): Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, ELRA and ICCL, pp. 1-7, Torino, Italia, 2024.

We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.

@inproceedings{fischer-etal-2024-contemporary,
title = {A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication},
author = {Stefan Fischer and Kateryna Haidarzhyi and J{\"o}rg Knappen and Olha Polishchuk and Yuliya Stodolinska and Elke Teich},
editor = {Mariana Romanyshyn and Nataliia Romanyshyn and Andrii Hlybovets and Oleksii Ignatenko},
url = {https://aclanthology.org/2024.unlp-1.1},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024},
pages = {1-7},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbr{\"u}cken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Menzel, Katrin

Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German Inproceedings

Fiser, Darja; Eskevich, Maria; Bordon, David (Ed.): Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024, ELRA and ICCL, pp. 57-65, Torino, Italia, 2024.

This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.

@inproceedings{menzel-2024-exploring,
title = {Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German},
author = {Katrin Menzel},
editor = {Darja Fiser and Maria Eskevich and David Bordon},
url = {https://aclanthology.org/2024.parlaclarin-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024},
pages = {57-65},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Alves, Diego; Degaetano-Ortlieb, Stefania; Schmidt, Elena; Teich, Elke

Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English Inproceedings

Bhatia, Archna; Bouma, Gosse; Seza Dogruoz, A.; Evang, Kilian; Garcia, Marcos; Giouli, Voula; Han, Lifeng; Nivre, Joakim; Rademaker, Alexandre (Ed.): Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, ELRA and ICCL, pp. 81-87, Torino, Italia, 2024.

We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.

@inproceedings{alves-etal-2024-diachronic,
title = {Diachronic Analysis of Multi-word Expression Functional Categories in Scientific English},
author = {Diego Alves and Stefania Degaetano-Ortlieb and Elena Schmidt and Elke Teich},
editor = {Archna Bhatia and Gosse Bouma and A. Seza Dogruoz and Kilian Evang and Marcos Garcia and Voula Giouli and Lifeng Han and Joakim Nivre and Alexandre Rademaker},
url = {https://aclanthology.org/2024.mwe-1.12},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024},
pages = {81-87},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present a diachronic analysis of multi-word expressions (MWEs) in English based on the Royal Society Corpus, a dataset containing 300+ years of the scientific publications of the Royal Society of London. Specifically, we investigate the functions of MWEs, such as stance markers (“is is interesting”) or discourse organizers (“in this section”), and their development over time. Our approach is multi-disciplinary: to detect MWEs we use Universal Dependencies, to classify them functionally we use an approach from register linguistics, and to assess their role in diachronic development we use an information-theoretic measure, relative entropy.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bagdasarov, Sergei; Degaetano-Ortlieb, Stefania

Applying Information-theoretic Notions to Measure Effects of the Plain English Movement on English Law Reports and Scientific Articles Inproceedings

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Kazantseva, Anna; Szpakowicz, Stan (Ed.): Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), Association for Computational Linguistics, pp. 101-110, St. Julians, Malta, 2024.

We investigate the impact of the Plain English Movement (PEM) on the complexity of legal language in UK law reports from the 1950s-2010s, contrasting it with the evolution of scientific language. The PEM, emerging in the late 20th century, advocated for clear and understandable legal language. We define complexity through the concept of surprisal – an information-theoretic measure correlating with cognitive processing difficulty. Our research contrasts surprisal with traditional readability measures, which often overlook content. We hypothesize that, if the PEM has influenced legal language, there would be a reduction in complexity over time and a shift from a nominal to a more verbal style. We analyze text complexity and lexico-grammatical changes in line with PEM recommendations. Results indicate minimal impact of the PEM on both legal and scientific domains. This finding suggests future research should consider processing effort when advocating for linguistic norms to enhance accessibility.

@inproceedings{bagdasarov-degaetano-ortlieb-2024-applying,
title = {Applying Information-theoretic Notions to Measure Effects of the Plain English Movement on English Law Reports and Scientific Articles},
author = {Sergei Bagdasarov and Stefania Degaetano-Ortlieb},
editor = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Anna Kazantseva and Stan Szpakowicz},
url = {https://aclanthology.org/2024.latechclfl-1.11},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)},
pages = {101-110},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {We investigate the impact of the Plain English Movement (PEM) on the complexity of legal language in UK law reports from the 1950s-2010s, contrasting it with the evolution of scientific language. The PEM, emerging in the late 20th century, advocated for clear and understandable legal language. We define complexity through the concept of surprisal - an information-theoretic measure correlating with cognitive processing difficulty. Our research contrasts surprisal with traditional readability measures, which often overlook content. We hypothesize that, if the PEM has influenced legal language, there would be a reduction in complexity over time and a shift from a nominal to a more verbal style. We analyze text complexity and lexico-grammatical changes in line with PEM recommendations. Results indicate minimal impact of the PEM on both legal and scientific domains. This finding suggests future research should consider processing effort when advocating for linguistic norms to enhance accessibility.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kray, Jutta; Sommerfeld, Linda; Borovsky, Arielle; Häuser, Katja

The role of prediction error in the development of language learning and memory Journal Article

Child Development Perspectives, pp. 1–14, 2024.

Prediction error plays a pivotal role in theories of learning, including theories of language acquisition and use. Researchers have investigated whether and under which conditions children, like adults, use prediction to facilitate language comprehension at different levels of linguistic representation. However, many aspects of the reciprocal relation between prediction error and the development of language learning remain unclear. In this article, we review studies in language development that can inform us about the role of prediction error in updating, learning, and retrieving linguistic information. We argue that the study of individual differences in linguistic and cognitive skills will help the field understand more thoroughly whether, when, and why prediction aids language learning, and whether prediction error necessarily results in language learning and retrieval from memory. We close with a discussion of the needs and challenges for researchers to answer these questions.

@article{Kray_etal_2024,
title = {The role of prediction error in the development of language learning and memory},
author = {Jutta Kray and Linda Sommerfeld and Arielle Borovsky and Katja H{\"a}user},
url = {https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdep.12515},
doi = {https://doi.org/10.1111/cdep.12515},
year = {2024},
date = {2024},
journal = {Child Development Perspectives},
pages = {1–14},
abstract = {

Prediction error plays a pivotal role in theories of learning, including theories of language acquisition and use. Researchers have investigated whether and under which conditions children, like adults, use prediction to facilitate language comprehension at different levels of linguistic representation. However, many aspects of the reciprocal relation between prediction error and the development of language learning remain unclear. In this article, we review studies in language development that can inform us about the role of prediction error in updating, learning, and retrieving linguistic information. We argue that the study of individual differences in linguistic and cognitive skills will help the field understand more thoroughly whether, when, and why prediction aids language learning, and whether prediction error necessarily results in language learning and retrieval from memory. We close with a discussion of the needs and challenges for researchers to answer these questions.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A5

Marchal, Marian; Scholman, Merel; Sanders, Ted J. M.; Demberg, Vera

What processing instructions do connectives provide? Modeling the facilitative effect of the connective Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 3435-3441, 2024.

Connectives like ‘because’ are referred to as ‘processing instructions’ as they facilitate processing of linguistic material directly following the connective. In an expectation-driven account of discourse processing, this can be attributed to predictions that readers make about the upcoming discourse relation, but also to predictions about up-coming discourse content. By modeling these two accounts, termed the relation prediction account and the content prediction account respectively, we show that they make different predictions about when the presence of a connective is most beneficial. In a self-paced reading study, we replicate the facilitative effect of the connective on processing, but do not find any evidence that this effect can be explained by a strong or weak version of either of the two accounts. This suggests that the role of the connective goes above and beyond informing the reader about the upcoming relation and content and possibly triggers a different processing strategy.

@inproceedings{marchal-etal-2024,
title = {What processing instructions do connectives provide? Modeling the facilitative effect of the connective},
author = {Marian Marchal and Merel Scholman and Ted J. M. Sanders and Vera Demberg},
url = {https://escholarship.org/uc/item/2sc1k7pf},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {3435-3441},
abstract = {Connectives like ‘because’ are referred to as ‘processing instructions’ as they facilitate processing of linguistic material directly following the connective. In an expectation-driven account of discourse processing, this can be attributed to predictions that readers make about the upcoming discourse relation, but also to predictions about up-coming discourse content. By modeling these two accounts, termed the relation prediction account and the content prediction account respectively, we show that they make different predictions about when the presence of a connective is most beneficial. In a self-paced reading study, we replicate the facilitative effect of the connective on processing, but do not find any evidence that this effect can be explained by a strong or weak version of either of the two accounts. This suggests that the role of the connective goes above and beyond informing the reader about the upcoming relation and content and possibly triggers a different processing strategy.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Achimova, Asya; van Os, Marjolein; Demberg, Vera; Butz, Martin V.

Interpreting implausible event descriptions under noise Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 3399-3406, 2024.

Gricean maxims prescribe cooperative speakers to make their utterances maximally informative so that listeners have the highest chance of understanding the utterances. At the same time, speakers are expected to save effort and not produce descriptions that are more explicit than necessary. In this work, we first ask how predictability of the described events affects the choice of anaphoric referring expressions. We show that speakers prefer phonologically overt descriptions, such as definite NPs, when they refer to agents that behave in an unexpected way. We further test how the interpretation of referring expressions changes depending on the listening conditions and prior expectations about the plausibility of an event. Our work shows that the speaker’s extra effort in choosing a more phonologically overt referring expression is justified by listeners‘ behavior: they report having heard an utterance which is more plausible than the originally spoken utterance and which contains additional phonological material.

@inproceedings{Achimova-etal-2024,
title = {Interpreting implausible event descriptions under noise},
author = {Asya Achimova and Marjolein van Os and Vera Demberg and Martin V. Butz},
url = {https://escholarship.org/uc/item/13n5660h},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {3399-3406},
abstract = {Gricean maxims prescribe cooperative speakers to make their utterances maximally informative so that listeners have the highest chance of understanding the utterances. At the same time, speakers are expected to save effort and not produce descriptions that are more explicit than necessary. In this work, we first ask how predictability of the described events affects the choice of anaphoric referring expressions. We show that speakers prefer phonologically overt descriptions, such as definite NPs, when they refer to agents that behave in an unexpected way. We further test how the interpretation of referring expressions changes depending on the listening conditions and prior expectations about the plausibility of an event. Our work shows that the speaker's extra effort in choosing a more phonologically overt referring expression is justified by listeners' behavior: they report having heard an utterance which is more plausible than the originally spoken utterance and which contains additional phonological material.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A4

Liang, Yiming; Amsili, Pascal; Burnett, Heather; Demberg, Vera

Uniform information density explains subject doubling in French Inproceedings

Proceedings of the Annual Meeting of the Cognitive Science Society, 46, pp. 780-788, 2024.

In this paper we investigate whether subject doubling in French is affected by the Uniform Information Density (UID) principle, which states that speakers prefer language encoding that minimizes fluctuations in information density. We show that, other factors being controlled, speakers are more likely to double the NP subject when it has a high surprisal, thus providing further empirical evidence to the UID principle which predicts a surprisal-redundancy trade-off as a property of natural languages. We argue for the importance of employing GPT-2 to investigate complex linguistic phenomena such as subject doubling, as it enables the estimation of subject surprisal by considering a rather large conversational context, a task made possible by powerful language models that incorporate linguistic knowledge through pre-training on extensive datasets.

@inproceedings{Liang-etal-2024,
title = {Uniform information density explains subject doubling in French},
author = {Yiming Liang and Pascal Amsili and Heather Burnett and Vera Demberg},
url = {https://escholarship.org/uc/item/645673fs},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Annual Meeting of the Cognitive Science Society},
pages = {780-788},
abstract = {In this paper we investigate whether subject doubling in French is affected by the Uniform Information Density (UID) principle, which states that speakers prefer language encoding that minimizes fluctuations in information density. We show that, other factors being controlled, speakers are more likely to double the NP subject when it has a high surprisal, thus providing further empirical evidence to the UID principle which predicts a surprisal-redundancy trade-off as a property of natural languages. We argue for the importance of employing GPT-2 to investigate complex linguistic phenomena such as subject doubling, as it enables the estimation of subject surprisal by considering a rather large conversational context, a task made possible by powerful language models that incorporate linguistic knowledge through pre-training on extensive datasets.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A4

Voigtmann, Sophia

Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des frühen Neuhochdeutschen PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Präpositionalphrasen und Relativsätzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile für die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (größerer) Sicherheit vorhergesagt werden können. Somit ist im Nachfeld mehr kognitive Kapazität zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu können, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit Länge assoziiert, auch ein gedächtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Außerdem wurde untersucht, ob Extraposition von der konzeptionellen Mündlichkeit eines Textes (vgl. Koch & Österreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Veränderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativsätze und Nominal- sowie Präpositionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verstärkt in mündlichkeitsnahen Texten verwendet. 2b) In Texten, die mündlichkeitsnäher sind, ist der Einfluss von hohen Surprisalwerten größer als in schriftlichkeitsnahen Texten. 3) Über die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur Überprüfung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden händisch alle extraponierten Nominal- und Präpositionalphrasen mit Gegenstücken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativsätze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe über das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. Über das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der Mündlichkeitsnähe, ermittelt. Zusätzlich wurde die Länge für jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen können, was mit deren vielfältigeren Funktionen – verglichen mit den Präpositionalphrasen und attributiven Relativsätzen – erklärt wird. Bei den beiden anderen Phänomenen spielt die Länge eine größere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Veränderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind außerdem mündlichkeitsnäher als die medizinischen Texte. Beide Genre werden über den untersuchten Zeitraum hinweg aber schriftlichkeitsnäher, was auch für eine Annäherung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen Mündlichkeitsnähe und Extraposition nur für Nominalphrasen bestätigt werden. Bei einer Zweiteilung des Korpus in mündlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den mündlichkeitsnahen Texten Extraposition erklären können. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der Länge die der Surprisalwerte in jüngeren Texten übersteigt. Es wurde dafür argumentiert, dass eine Gewöhnung an kürzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer für die Verarbeitung in den Mittelpunkt.

@phdthesis{Voigtmann_Diss_2024,
title = {Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des fr{\"u}hen Neuhochdeutschen},
author = {Sophia Voigtmann},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/37369},
doi = {https://doi.org/10.22028/D291-41751},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Pr{\"a}positionalphrasen und Relativs{\"a}tzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile f{\"u}r die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (gr{\"o}{\ss}erer) Sicherheit vorhergesagt werden k{\"o}nnen. Somit ist im Nachfeld mehr kognitive Kapazit{\"a}t zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu k{\"o}nnen, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit L{\"a}nge assoziiert, auch ein ged{\"a}chtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Au{\ss}erdem wurde untersucht, ob Extraposition von der konzeptionellen M{\"u}ndlichkeit eines Textes (vgl. Koch & {\"O}sterreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Ver{\"a}nderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativs{\"a}tze und Nominal- sowie Pr{\"a}positionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verst{\"a}rkt in m{\"u}ndlichkeitsnahen Texten verwendet. 2b) In Texten, die m{\"u}ndlichkeitsn{\"a}her sind, ist der Einfluss von hohen Surprisalwerten gr{\"o}{\ss}er als in schriftlichkeitsnahen Texten. 3) {\"U}ber die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur {\"U}berpr{\"u}fung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden h{\"a}ndisch alle extraponierten Nominal- und Pr{\"a}positionalphrasen mit Gegenst{\"u}cken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativs{\"a}tze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe {\"u}ber das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. {\"U}ber das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der M{\"u}ndlichkeitsn{\"a}he, ermittelt. Zus{\"a}tzlich wurde die L{\"a}nge f{\"u}r jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen k{\"o}nnen, was mit deren vielf{\"a}ltigeren Funktionen – verglichen mit den Pr{\"a}positionalphrasen und attributiven Relativs{\"a}tzen – erkl{\"a}rt wird. Bei den beiden anderen Ph{\"a}nomenen spielt die L{\"a}nge eine gr{\"o}{\ss}ere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Ver{\"a}nderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind au{\ss}erdem m{\"u}ndlichkeitsn{\"a}her als die medizinischen Texte. Beide Genre werden {\"u}ber den untersuchten Zeitraum hinweg aber schriftlichkeitsn{\"a}her, was auch f{\"u}r eine Ann{\"a}herung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen M{\"u}ndlichkeitsn{\"a}he und Extraposition nur f{\"u}r Nominalphrasen best{\"a}tigt werden. Bei einer Zweiteilung des Korpus in m{\"u}ndlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den m{\"u}ndlichkeitsnahen Texten Extraposition erkl{\"a}ren k{\"o}nnen. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der L{\"a}nge die der Surprisalwerte in j{\"u}ngeren Texten {\"u}bersteigt. Es wurde daf{\"u}r argumentiert, dass eine Gew{\"o}hnung an k{\"u}rzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer f{\"u}r die Verarbeitung in den Mittelpunkt.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C6

Bourgonje, Peter; Lin, Pin-Jie

Projecting Annotations for Discourse Relations: Connective Identification for Low-Resource Languages Inproceedings

Strube, Michael; Braud, Chloe; Hardmeier, Christian; Jessy Li, Junyi; Loaiciga, Sharid; Zeldes, Amir; Li, Chuyuan (Ed.): Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024), Association for Computational Linguistics, pp. 39-49, St. Julians, Malta, 2024.

We present a pipeline for multi-lingual Shallow Discourse Parsing. The pipeline exploits Machine Translation and Word Alignment, by translating any incoming non-English input text into English, applying an English discourse parser, and projecting the found relations onto the original input text through word alignments. While the purpose of the pipeline is to provide rudimentary discourse relation annotations for low-resource languages, in order to get an idea of performance, we evaluate it on the sub-task of discourse connective identification for several languages for which gold data are available. We experiment with different setups of our modular pipeline architecture and analyze intermediate results. Our code is made available on GitHub.

@inproceedings{bourgonje-lin-2024-projecting,
title = {Projecting Annotations for Discourse Relations: Connective Identification for Low-Resource Languages},
author = {Peter Bourgonje and Pin-Jie Lin},
editor = {Michael Strube and Chloe Braud and Christian Hardmeier and Junyi Jessy Li and Sharid Loaiciga and Amir Zeldes and Chuyuan Li},
url = {https://aclanthology.org/2024.codi-1.4},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)},
pages = {39-49},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {We present a pipeline for multi-lingual Shallow Discourse Parsing. The pipeline exploits Machine Translation and Word Alignment, by translating any incoming non-English input text into English, applying an English discourse parser, and projecting the found relations onto the original input text through word alignments. While the purpose of the pipeline is to provide rudimentary discourse relation annotations for low-resource languages, in order to get an idea of performance, we evaluate it on the sub-task of discourse connective identification for several languages for which gold data are available. We experiment with different setups of our modular pipeline architecture and analyze intermediate results. Our code is made available on GitHub.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Meßmer, Julia

A functional perspective on schema-based learning and recognition of novel word associations PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

With the current research, we sought to develop a functional perspective on schema-based learning of novel word associations, i.e., novel compound words and their later recognition. In combining the idea that both, schema-based learning (e.g., Hebscher et al., 2019; van Kesteren et al., 2012) and unitization (e.g., Bader et al., 2014; Haskins et al., 2008; see Henke, 2010) might rely less on hippocampal contribution than traditional associative learning, we hypothesized that schema-congruency might support the formation of unitized representations that could then be recognized by means of an absolute familiarity process (Mecklinger & Bader, 2020). All three experiments presented include an incidental learning phase, in which novel compound words were learned together with a preceding definition that was either congruent or neutral (experimental manipulation of schema congruency). After a retention interval of about 10 minutes, a surprise memory test followed. In the test phase, participants were shown different types of compound words and instructed to classify each as intact, recombined, or new (Exp. 1), as old (intact) or new (recombined, similar lures; Exp. 3) or underwent an implicit lexical decision task (Exp. 2). Our results imply that three processes might underly schema-based learning. Semantic priming, indicated by an N400 attenuation effect in the schema-congruent condition, establishes schema congruency. Condition-independent semantic integration of the constituents is beneficial for memory formation, as indicated by an N400 subsequent memory effect (SME). Lastly, we found a larger parietal SME in the congruent than in the neutral condition. This might reflect the formation of a conceptual (unitized) representation under the influence of a congruent schema. Second, based on our results, schema-congruency might support the formation of unitized representations, indicated by schema-congruency being more beneficial for associative than item memory performance (see Parks & Yonelinas, 2015). The neurocognitive processes underling recognition of those compound words might include larger absolute familiarity contributing to associative recognition in the congruent than in a neutral control condition, indicated by an N400 attenuation effect. Based on data from our third experiment including semantically similar distractors during the recognition memory test, we concluded that the representations formed under the influence of a schema might be gist-like. Those might be created next to episodic associations that are probably also formed in traditional associative learning. Lastly, those unitized memory representations formed under the influence of a schema cannot only be accessed in an explicit memory test, but also affect performance in an implicit memory test.


Das Ziel der vorliegenden Arbeit war es, eine funktionelle Perspektive auf das schema-basierte Lernen neuer Wortassoziationen (Komposita) und deren späteres Wiedererkennen zu entwickeln. Dazu wurden zwei Forschungsideen zusammengeführt. Da sowohl schema-basiertes Lernen (z.B., Hebscher et al., 2019; van Kesteren et al., 2012) als auch Unitarisierung (z.B., Bader et al., 2014; Haskins et al., 2008; siehe auch Henke, 2010) weniger hippocampale Beteiligung aufweisen als traditionelles Assoziationslernen, formulierten wir die Hypothese, dass Schemakongruenz die Bildung unitarisierter Repräsentationen unterstützen könnte, die dann mittels eines absoluten Vertrautheitsprozesses wiedererkannt werden könnten (Mecklinger & Bader, 2020). Die drei Experimente, die in der vorliegenden Arbeit dargestellt sind, beinhalten alle eine inzidentelle Lernphase, in der neue Komposita zusammen mit einer kongruenten oder neutralen vorangehenden Definition gelernt wurden (experimentelle Manipulation von Schemakongruenz). Nach einem Retentionsintervall von etwa 10 Minuten folgte ein überraschender, nicht vorangekündigter Gedächtnistest. In dieser Testphase sahen die Teilnehmenden verschiedene Arten von Komposita und sollten diese als intakt, rekombiniert oder neu klassifizieren (Experiment 1), als alt (intakt) oder neu (rekombiniert, ähnliche Distraktoren; Experiment 3) oder bearbeiteten eine lexikalische Entscheidungsaufgabe (Experiment 2). Unsere Ergebnisse implizieren, dass drei Prozesse am schema-basiertem Lernen beteiligt sind. Semantisches Priming, angezeigt durch eine reduzierte N400 Amplitude in der schema-kongruenten Bedingungen, führt zu Schemakongruenz. Die bedingungsunabhängige semantische Integration der Wortbestandteile ist förderlich für die Gedächtnisbildung, indiziert durch einen N400 Subsequent Memory Effect (SME). Der dritte Prozess, die schemakongruenzgetriebene Bildung einer konzeptuellen (unitarisierten) Repräsentation wird angezeigt durch einen größeren parietalen SME in der kongruenten im Vergleich zur neutralen Bedingung. Basierend auf dem behavioralen Ergebnismuster, dass assoziatives Gedächtnis mehr von Schemakongruenz profitiert als Itemgedächtnis (siehe auch Parks & Yonelinas, 2015), könnte Schemakongruenz die Bildung von unitarisierten Repräsentationen fördern. Die neurokognitiven Prozesse, die dem Wiedererkennen solcher Komposita unterliegen, beinhalten wahrscheinlich einen höheren Anteil absoluter Vertrautheit in der kongruenten als in der neutralen Bedingung, indiziert durch einen entsprechenden reduzierten N400-Effekt. Basierend auf den Ergebnissen des dritten Experiments, bei dem der Rekognitionstest semantisch ähnliche Distraktoren beinhaltete, schlussfolgerten wir, dass die Repräsentationen, die unter dem Einfluss eines Schemas gebildet werden, detailarm sind und lediglich die semantische Konzeptstruktur (gist) beinhalten. Diese Repräsentationen könnten parallel zu episodischen Assoziationen geformt werden, die wahrscheinlich beim traditionellen Assoziationslernen gebildet werden. Die unitarisierten Repräsentationen konnten hierbei nicht nur in einem expliziten Gedächtnistest verwendet werden, sondern auch die Performanz in einer impliziten Gedächtnisaufgabe beeinflussen.

@phdthesis{Meßmer_Diss,
title = {A functional perspective on schema-based learning and recognition of novel word associations},
author = {Julia Me{\ss}mer},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {With the current research, we sought to develop a functional perspective on schema-based learning of novel word associations, i.e., novel compound words and their later recognition. In combining the idea that both, schema-based learning (e.g., Hebscher et al., 2019; van Kesteren et al., 2012) and unitization (e.g., Bader et al., 2014; Haskins et al., 2008; see Henke, 2010) might rely less on hippocampal contribution than traditional associative learning, we hypothesized that schema-congruency might support the formation of unitized representations that could then be recognized by means of an absolute familiarity process (Mecklinger & Bader, 2020). All three experiments presented include an incidental learning phase, in which novel compound words were learned together with a preceding definition that was either congruent or neutral (experimental manipulation of schema congruency). After a retention interval of about 10 minutes, a surprise memory test followed. In the test phase, participants were shown different types of compound words and instructed to classify each as intact, recombined, or new (Exp. 1), as old (intact) or new (recombined, similar lures; Exp. 3) or underwent an implicit lexical decision task (Exp. 2). Our results imply that three processes might underly schema-based learning. Semantic priming, indicated by an N400 attenuation effect in the schema-congruent condition, establishes schema congruency. Condition-independent semantic integration of the constituents is beneficial for memory formation, as indicated by an N400 subsequent memory effect (SME). Lastly, we found a larger parietal SME in the congruent than in the neutral condition. This might reflect the formation of a conceptual (unitized) representation under the influence of a congruent schema. Second, based on our results, schema-congruency might support the formation of unitized representations, indicated by schema-congruency being more beneficial for associative than item memory performance (see Parks & Yonelinas, 2015). The neurocognitive processes underling recognition of those compound words might include larger absolute familiarity contributing to associative recognition in the congruent than in a neutral control condition, indicated by an N400 attenuation effect. Based on data from our third experiment including semantically similar distractors during the recognition memory test, we concluded that the representations formed under the influence of a schema might be gist-like. Those might be created next to episodic associations that are probably also formed in traditional associative learning. Lastly, those unitized memory representations formed under the influence of a schema cannot only be accessed in an explicit memory test, but also affect performance in an implicit memory test.


Das Ziel der vorliegenden Arbeit war es, eine funktionelle Perspektive auf das schema-basierte Lernen neuer Wortassoziationen (Komposita) und deren sp{\"a}teres Wiedererkennen zu entwickeln. Dazu wurden zwei Forschungsideen zusammengef{\"u}hrt. Da sowohl schema-basiertes Lernen (z.B., Hebscher et al., 2019; van Kesteren et al., 2012) als auch Unitarisierung (z.B., Bader et al., 2014; Haskins et al., 2008; siehe auch Henke, 2010) weniger hippocampale Beteiligung aufweisen als traditionelles Assoziationslernen, formulierten wir die Hypothese, dass Schemakongruenz die Bildung unitarisierter Repr{\"a}sentationen unterst{\"u}tzen k{\"o}nnte, die dann mittels eines absoluten Vertrautheitsprozesses wiedererkannt werden k{\"o}nnten (Mecklinger & Bader, 2020). Die drei Experimente, die in der vorliegenden Arbeit dargestellt sind, beinhalten alle eine inzidentelle Lernphase, in der neue Komposita zusammen mit einer kongruenten oder neutralen vorangehenden Definition gelernt wurden (experimentelle Manipulation von Schemakongruenz). Nach einem Retentionsintervall von etwa 10 Minuten folgte ein {\"u}berraschender, nicht vorangek{\"u}ndigter Ged{\"a}chtnistest. In dieser Testphase sahen die Teilnehmenden verschiedene Arten von Komposita und sollten diese als intakt, rekombiniert oder neu klassifizieren (Experiment 1), als alt (intakt) oder neu (rekombiniert, {\"a}hnliche Distraktoren; Experiment 3) oder bearbeiteten eine lexikalische Entscheidungsaufgabe (Experiment 2). Unsere Ergebnisse implizieren, dass drei Prozesse am schema-basiertem Lernen beteiligt sind. Semantisches Priming, angezeigt durch eine reduzierte N400 Amplitude in der schema-kongruenten Bedingungen, f{\"u}hrt zu Schemakongruenz. Die bedingungsunabh{\"a}ngige semantische Integration der Wortbestandteile ist f{\"o}rderlich f{\"u}r die Ged{\"a}chtnisbildung, indiziert durch einen N400 Subsequent Memory Effect (SME). Der dritte Prozess, die schemakongruenzgetriebene Bildung einer konzeptuellen (unitarisierten) Repr{\"a}sentation wird angezeigt durch einen gr{\"o}{\ss}eren parietalen SME in der kongruenten im Vergleich zur neutralen Bedingung. Basierend auf dem behavioralen Ergebnismuster, dass assoziatives Ged{\"a}chtnis mehr von Schemakongruenz profitiert als Itemged{\"a}chtnis (siehe auch Parks & Yonelinas, 2015), k{\"o}nnte Schemakongruenz die Bildung von unitarisierten Repr{\"a}sentationen f{\"o}rdern. Die neurokognitiven Prozesse, die dem Wiedererkennen solcher Komposita unterliegen, beinhalten wahrscheinlich einen h{\"o}heren Anteil absoluter Vertrautheit in der kongruenten als in der neutralen Bedingung, indiziert durch einen entsprechenden reduzierten N400-Effekt. Basierend auf den Ergebnissen des dritten Experiments, bei dem der Rekognitionstest semantisch {\"a}hnliche Distraktoren beinhaltete, schlussfolgerten wir, dass die Repr{\"a}sentationen, die unter dem Einfluss eines Schemas gebildet werden, detailarm sind und lediglich die semantische Konzeptstruktur (gist) beinhalten. Diese Repr{\"a}sentationen k{\"o}nnten parallel zu episodischen Assoziationen geformt werden, die wahrscheinlich beim traditionellen Assoziationslernen gebildet werden. Die unitarisierten Repr{\"a}sentationen konnten hierbei nicht nur in einem expliziten Ged{\"a}chtnistest verwendet werden, sondern auch die Performanz in einer impliziten Ged{\"a}chtnisaufgabe beeinflussen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   A6

Dipper, Stefanie; Haiber, Cora; Schröter, Anna Maria; Wiemann, Alexandra; Brinkschulte, Maike

Universal Dependencies: Extensions for Modern and Historical German Inproceedings

Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.): Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, pp. 17101-17111, Torino, Italia, 2024.

In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.

@inproceedings{dipper-etal-2024-universal-dependencies,
title = {Universal Dependencies: Extensions for Modern and Historical German},
author = {Stefanie Dipper and Cora Haiber and Anna Maria Schr{\"o}ter and Alexandra Wiemann and Maike Brinkschulte},
editor = {Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
url = {https://aclanthology.org/2024.lrec-main.1485},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages = {17101-17111},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin; Dipper, Stefanie

Nähetexte automatisch erkennen: Entwicklung eines linguistischen Scores für konzeptionelle Mündlichkeit in historischen Texten. Book Chapter

Imo, Wolfgang; Wesche, Jörg (Ed.): Sprechen und Gespräch in historischer Perspektive: Sprach-und literaturwissenschaftliche Zugänge, Metzler, pp. 17-36, Berlin, Heidelberg, 2024.

Dieser Beitrag stellt einen automatisch bestimmbaren Score zur Einschätzung der konzeptionellen Mündlichkeit eines historischen Textes vor. Der Score basiert auf einer Reihe von linguistischen Merkmalen wie durchschnittlicher Wortlänge, Häufigkeit von Personalpronomen der 1.Person, Verhältnis Vollverben zu Nomen oder dem Anteil von Inhaltswörtern am Gesamttext. Diese Merkmale werden bei der Berechnung des Mündlichkeits-Scores unterschiedlich gewichtet. Die Gewichte wurden mit Hilfe des Kasseler Junktionskorpus (Ágel und Hennig 2008) festgelegt, dessen Texte von Expert/innen mit Nähewerten versehen wurden. In einer 5-fachen Kreuzvalidierung zeigt sich,dass der automatisch bestimmte Mündlichkeits-Score in einem sehr hohen Maß mit dem Experten-Score korreliert (r = 0.9175).

@inbook{Ortmann_Dipper_2024,
title = {N{\"a}hetexte automatisch erkennen: Entwicklung eines linguistischen Scores f{\"u}r konzeptionelle M{\"u}ndlichkeit in historischen Texten.},
author = {Katrin Ortmann and Stefanie Dipper},
editor = {Wolfgang Imo and J{\"o}rg Wesche},
url = {https://link.springer.com/chapter/10.1007/978-3-662-67677-6_2},
year = {2024},
date = {2024},
booktitle = {Sprechen und Gespr{\"a}ch in historischer Perspektive: Sprach-und literaturwissenschaftliche Zug{\"a}nge},
pages = {17-36},
publisher = {Metzler},
address = {Berlin, Heidelberg},
abstract = {

Dieser Beitrag stellt einen automatisch bestimmbaren Score zur Einsch{\"a}tzung der konzeptionellen M{\"u}ndlichkeit eines historischen Textes vor. Der Score basiert auf einer Reihe von linguistischen Merkmalen wie durchschnittlicher Wortl{\"a}nge, H{\"a}ufigkeit von Personalpronomen der 1.Person, Verh{\"a}ltnis Vollverben zu Nomen oder dem Anteil von Inhaltsw{\"o}rtern am Gesamttext. Diese Merkmale werden bei der Berechnung des M{\"u}ndlichkeits-Scores unterschiedlich gewichtet. Die Gewichte wurden mit Hilfe des Kasseler Junktionskorpus ({\'A}gel und Hennig 2008) festgelegt, dessen Texte von Expert/innen mit N{\"a}hewerten versehen wurden. In einer 5-fachen Kreuzvalidierung zeigt sich,dass der automatisch bestimmte M{\"u}ndlichkeits-Score in einem sehr hohen Ma{\ss} mit dem Experten-Score korreliert (r = 0.9175).
},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Alves, Diego; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Teich, Elke

Multi-word Expressions in English Scientific Writing Inproceedings

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Kazantseva, Anna; Szpakowicz, Stan (Ed.): Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024), Association for Computational Linguistics, pp. 67-76, St. Julians, Malta, 2024.

Multi-Word Expressions (MWEs) play a pivotal role in language use overall and in register formation more specifically, e.g. encoding field-specific terminology. Our study focuses on the identification and categorization of MWEs used in scientific writing, considering their formal characteristics as well as their developmental trajectory over time from the mid-17th century to the present. For this, we develop an approach combining three different types of methods to identify MWEs (Universal Dependency annotation, Partitioner and the Academic Formulas List) and selected measures to characterize MWE properties (e.g., dispersion by Kullback-Leibler Divergence and several association measures). This allows us to inspect MWEs types in a novel data-driven way regarding their functions and change over time in specialized discourse.

@inproceedings{alves-etal-2024-multi,
title = {Multi-word Expressions in English Scientific Writing},
author = {Diego Alves and Stefan Fischer and Stefania Degaetano-Ortlieb and Elke Teich},
editor = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Anna Kazantseva and Stan Szpakowicz},
url = {https://aclanthology.org/2024.latechclfl-1.8},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)},
pages = {67-76},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {Multi-Word Expressions (MWEs) play a pivotal role in language use overall and in register formation more specifically, e.g. encoding field-specific terminology. Our study focuses on the identification and categorization of MWEs used in scientific writing, considering their formal characteristics as well as their developmental trajectory over time from the mid-17th century to the present. For this, we develop an approach combining three different types of methods to identify MWEs (Universal Dependency annotation, Partitioner and the Academic Formulas List) and selected measures to characterize MWE properties (e.g., dispersion by Kullback-Leibler Divergence and several association measures). This allows us to inspect MWEs types in a novel data-driven way regarding their functions and change over time in specialized discourse.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Ibrahim, Omnia; Yuen, Ivan; Xue, Wei; Andreeva, Bistra; Möbius, Bernd

Listener-oriented consequences of predictability-based acoustic adjustment Inproceedings

Baumann, Timo (Ed.): Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg), TUD Press, pp. 196-202, 2024, ISBN 978-3-95908-325-6.

This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.

@inproceedings{Ibrahim_etal_2024,
title = {Listener-oriented consequences of predictability-based acoustic adjustment},
author = {Omnia Ibrahim and Ivan Yuen and Wei Xue and Bistra Andreeva and Bernd M{\"o}bius},
editor = {Timo Baumann},
url = {https://opus4.kobv.de/opus4-oth-regensburg/frontdoor/index/index/docId/7098},
doi = {https://doi.org/10.35096/othr/pub-7098},
year = {2024},
date = {2024},
booktitle = {Elektronische Sprachsignalverarbeitung 2024, Tagungsband der 35. Konferenz (Regensburg)},
isbn = {978-3-95908-325-6},
pages = {196-202},
publisher = {TUD Press},
abstract = {This paper investigated whether predictability-based adjustments in production have listener-oriented consequences in perception. By manipulating the acoustic features of a target syllable in different predictability contexts in German, we tested 40 listeners’ perceptual preference for the manipulation. Four source words underwent acoustic modifications on the target syllable. Our results revealed a general preference for the original (unmodified) version over the modified one. However, listeners generally favored the unmodified version more when the source word had a higher predictable context compared to a less predictable one. The results showed that predictability-based adjustments have perceptual consequences and that listeners have predictability-based expectations in perception.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Yung, Frances Pik Yu; Ahmad, Mansoor; Scholman, Merel; Demberg, Vera

Prompting Implicit Discourse Relation Annotation Inproceedings

Proceedings of Linguistic Annotation Workshop of European Chapter of the Association for Computational Linguistics, 2024.

Pre-trained large language models, such as ChatGPT, archive outstanding performance in various reasoning tasks without supervised training and were found to have outperformed crowdsourcing workers. Nonetheless, ChatGPT’s performance in the task of implicit discourse relation classification, prompted by a standard multiple-choice question, is still far from satisfactory and considerably inferior to state-of-the-art supervised approaches. This work investigates several proven prompting techniques to improve ChatGPT’s recognition of discourse relations. In particular, we experimented with breaking down the classification task that involves numerous abstract labels into smaller subtasks. Nonetheless, experiment results show that the inference accuracy hardly changes even with sophisticated prompt engineering, suggesting that implicit discourse relation classification is not yet resolvable under zero-shot or few-shot settings.

@inproceedings{yung-etal-2024-prompting,
title = {Prompting Implicit Discourse Relation Annotation},
author = {Frances Pik Yu Yung and Mansoor Ahmad and Merel Scholman and Vera Demberg},
url = {https://arxiv.org/abs/2402.04918},
year = {2024},
date = {2024},
booktitle = {Proceedings of Linguistic Annotation Workshop of European Chapter of the Association for Computational Linguistics},
abstract = {Pre-trained large language models, such as ChatGPT, archive outstanding performance in various reasoning tasks without supervised training and were found to have outperformed crowdsourcing workers. Nonetheless, ChatGPT's performance in the task of implicit discourse relation classification, prompted by a standard multiple-choice question, is still far from satisfactory and considerably inferior to state-of-the-art supervised approaches. This work investigates several proven prompting techniques to improve ChatGPT's recognition of discourse relations. In particular, we experimented with breaking down the classification task that involves numerous abstract labels into smaller subtasks. Nonetheless, experiment results show that the inference accuracy hardly changes even with sophisticated prompt engineering, suggesting that implicit discourse relation classification is not yet resolvable under zero-shot or few-shot settings.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Yung, Frances Pik Yu; Scholman, Merel; Zikanova, Sarka; Demberg, Vera

DiscoGeM 2.0: A parallel corpus of English, German, French and Czech implicit discourse relations Inproceedings

Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.): Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, pp. 4940-4956, Torino, Italia, 2024.

We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-linguistic differences in discourse marking and could improve automatic discourse parsing applications. It is openly downloadable here: https://github.com/merelscholman/DiscoGeM.

@inproceedings{yung-etal-2024-discogem-2,
title = {DiscoGeM 2.0: A parallel corpus of English, German, French and Czech implicit discourse relations},
author = {Frances Pik Yu Yung and Merel Scholman and Sarka Zikanova and Vera Demberg},
editor = {Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
url = {https://aclanthology.org/2024.lrec-main.443},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages = {4940-4956},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-linguistic differences in discourse marking and could improve automatic discourse parsing applications. It is openly downloadable here: https://github.com/merelscholman/DiscoGeM.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Lin, Pin-Jie; Scholman, Merel; Saeed, Muhammed; Demberg, Vera

Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin Inproceedings

Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.): Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, pp. 11510-11522, Torino, Italia, 2024.

Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.

@inproceedings{lin-etal-2024-modeling-orthographic,
title = {Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin},
author = {Pin-Jie Lin and Merel Scholman and Muhammed Saeed and Vera Demberg},
editor = {Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
url = {https://aclanthology.org/2024.lrec-main.1006},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages = {11510-11522},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Successfully