Publications - SFB 1102

Jablotschkin, Sarah; Lapshinova-Koltunski, Ekaterina; Zinsmeister, Heike

Coreference in simplified German: Linguistic features and challenges of automatic annotation Inproceedings

Ogrodniczuk, Maciej; Novak, Michal; Poesio, Massimo; Pradhan, Sameer; Ng, Vincent (Ed.): Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference, Association for Computational Linguistics, pp. 12-23, Suzhou, China, 2025.

Abstract
|
Links
|
BibTeX

In this paper, we analyse coreference annotation of the German language, focussing on the phenomenon of simplification, that is, the tendency to use words and constructions that are assumed to be easier perceived, understood, or produced. Simplification is one of the tools used by language users in order to optimise communication effectively. We are interested in how simplification is reflected in coreference in two different language products exposed to the phenomena of simplification: simultaneous interpreting and Easy German. For this, we automatically annotate simplified texts with coreference. We then evaluate the outputs of automatic annotation. In addition, we also look into quantitative distributions of some coreference features. Our findings show that although the language products under analysis diverge in terms of simplification driving factors, they share some specific coreference features. We also show that this specificity may cause annotation errors in simplified language, e.g. in non-nominal or split antecedents.

@inproceedings{jablotschkin-etal-2025-coreference,
title = {Coreference in simplified German: Linguistic features and challenges of automatic annotation},
author = {Sarah Jablotschkin and Ekaterina Lapshinova-Koltunski and Heike Zinsmeister},
editor = {Maciej Ogrodniczuk and Michal Novak and Massimo Poesio and Sameer Pradhan and Vincent Ng},
url = {https://aclanthology.org/2025.crac-1.2/},
doi = {https://doi.org/10.18653/v1/2025.crac-1.2},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference},
pages = {12-23},
publisher = {Association for Computational Linguistics},
address = {Suzhou, China},
abstract = {In this paper, we analyse coreference annotation of the German language, focussing on the phenomenon of simplification, that is, the tendency to use words and constructions that are assumed to be easier perceived, understood, or produced. Simplification is one of the tools used by language users in order to optimise communication effectively. We are interested in how simplification is reflected in coreference in two different language products exposed to the phenomena of simplification: simultaneous interpreting and Easy German. For this, we automatically annotate simplified texts with coreference. We then evaluate the outputs of automatic annotation. In addition, we also look into quantitative distributions of some coreference features. Our findings show that although the language products under analysis diverge in terms of simplification driving factors, they share some specific coreference features. We also show that this specificity may cause annotation errors in simplified language, e.g. in non-nominal or split antecedents.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects: B7 T1

Jablotschkin, Sarah; Zinsmeister, Heike

Wie ist der Anfang von Sätzen in Leichter Sprache? Ergebnisse der Studie LeiKo -- Ein Vergleichs-Korpus für Leichte Sprache und Einfache Sprache Miscellaneous

Schiffler, Inga; Baier, Elke; Kölln, Marco; Schneider, Nadine; Haarkamp, Angelika (Ed.): , 2024.

Abstract
|
Links
|
BibTeX

Dieser Text ist eine Zusammenfassung in Leichter Sprache

@miscellaneous{jablotschkin_wie_2024,
title = {Wie ist der Anfang von S{\"a}tzen in Leichter Sprache? Ergebnisse der Studie LeiKo -- Ein Vergleichs-Korpus f{\"u}r Leichte Sprache und Einfache Sprache},
author = {Sarah Jablotschkin and Heike Zinsmeister},
editor = {Inga Schiffler and Elke Baier and Marco K{\"o}lln and Nadine Schneider and Angelika Haarkamp},
url = {https://www.fdr.uni-hamburg.de/record/14827},
doi = {https://doi.org/10.25592/UHHFDM.14827},
year = {2024},
date = {2024},
abstract = {

Dieser Text ist eine Zusammenfassung in Leichter Sprache

},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project: T1

Jablotschkin, Sarah; Teich, Elke; Zinsmeister, Heike

DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis Inproceedings

Raya Chakravarthi, Bharathi; B, Bharathi; Buitelaar, Paul; Durairaj, Thenmozhi; Kovács, György; Ángel García Cumbreras, Miguel (Ed.): Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, Association for Computational Linguistics, pp. 106-117, St. Julians, Malta, 2024.

Abstract
|
Links
|
BibTeX

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

2024.ltedi-1.9 (0.27MB)
https://aclanthology.org/2024.ltedi-1.9

@inproceedings{jablotschkin-etal-2024-de,
title = {DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis},
author = {Sarah Jablotschkin and Elke Teich and Heike Zinsmeister},
editor = {Bharathi Raya Chakravarthi and Bharathi B and Paul Buitelaar and Thenmozhi Durairaj and Gy{\"o}rgy Kov{\'a}cs and Miguel {\'A}ngel Garc{\'i}a Cumbreras},
url = {https://aclanthology.org/2024.ltedi-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion},
pages = {106-117},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: T1

Jablotschkin, Sarah; Zinsmeister, Heike

LeiKo. Ein Vergleichskorpus für Leichte Sprache und Einfache Sprache Incollection

Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beiträge zur IDS-Methodenmesse 2022, Kupietz, Mark und Thomas Schmidt, Tübingen: Narr, 2023.

Abstract
|
Links
|
BibTeX

Mit dem Konzept “Easy-to-read” werden Teilsysteme natürlicher Sprachen bezeichnet, welche durch eine systematische Reduktion auf den Ebenen Lexik und Syntax entstehen und den Zugang zu geschriebenen Informationen für Erwachsene mit geringen Lesekompetenzen gewährleisten. Im Deutschen gibt es “Leichte Sprache”, welche sich nach spezifischen linguistischen und typografischen Regeln richtet, und die weniger restringierte “einfache Sprache”. Beide Varianten erhalten im akademischen sowie nicht-akademischen Diskurs vermehrt Aufmerksamkeit – nicht zuletzt dank der im Jahr 2009 in Deutschland ratifizierten UN-Behindertenrechtskonvention (UN-BRK).

@incollection{jablotschkin_zinsmeister_2023,
title = {LeiKo. Ein Vergleichskorpus f{\"u}r Leichte Sprache und Einfache Sprache},
author = {Sarah Jablotschkin and Heike Zinsmeister},
url = {https://www.ids-mannheim.de/fileadmin/aktuell/Jahrestagungen/2022/Methodenmesse/5_Jablotschkin_Zinsmeister_LeiKo.pdf},
year = {2023},
date = {2023},
booktitle = {Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beitr{\"a}ge zur IDS-Methodenmesse 2022},
publisher = {Kupietz, Mark und Thomas Schmidt},
address = {T{\"u}bingen: Narr},
abstract = {Mit dem Konzept “Easy-to-read” werden Teilsysteme nat{\"u}rlicher Sprachen bezeichnet, welche durch eine systematische Reduktion auf den Ebenen Lexik und Syntax entstehen und den Zugang zu geschriebenen Informationen f{\"u}r Erwachsene mit geringen Lesekompetenzen gew{\"a}hrleisten. Im Deutschen gibt es “Leichte Sprache”, welche sich nach spezifischen linguistischen und typografischen Regeln richtet, und die weniger restringierte “einfache Sprache”. Beide Varianten erhalten im akademischen sowie nicht-akademischen Diskurs vermehrt Aufmerksamkeit – nicht zuletzt dank der im Jahr 2009 in Deutschland ratifizierten UN-Behindertenrechtskonvention (UN-BRK).},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project: T1

Jablotschkin, Sarah; Benz, Nele; Zinsmeister, Heike

Evaluation of neural coreference annotation of simplified German Conference

Posterpräsentation auf der Computational Linguistics Poster Session im Rahmen der 45. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) in Köln, 2023.

Abstract
|
Links
|
BibTeX

This poster presents our evaluation of a neural coreference resolver (Schröder et al. 2021) on simplified German texts as well as the results of an annotation study that we conducted in order to analyse error sources.

The underlying corpus can be found on Zenodo: https://doi.org/10.5281/zenodo.3626763

@conference{jablotschkin_sarah_2023_12252,
title = {Evaluation of neural coreference annotation of simplified German},
author = {Sarah Jablotschkin and Nele Benz and Heike Zinsmeister},
url = {https://doi.org/10.25592/uhhfdm.12252},
doi = {https://doi.org/10.25592/uhhfdm.12252},
year = {2023},
date = {2023},
booktitle = {Posterpr{\"a}sentation auf der Computational Linguistics Poster Session im Rahmen der 45. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS) in K{\"o}ln},
abstract = {This poster presents our evaluation of a neural coreference resolver (Schr{\"o}der et al. 2021) on simplified German texts as well as the results of an annotation study that we conducted in order to analyse error sources. The underlying corpus can be found on Zenodo: https://doi.org/10.5281/zenodo.3626763},
pubstate = {published},
type = {conference}
}

Copy BibTeX to Clipboard

Project: T1