Publications

Jablotschkin, Sarah; Teich, Elke; Zinsmeister, Heike

DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis Inproceedings

Raya Chakravarthi, Bharathi; B, Bharathi; Buitelaar, Paul; Durairaj, Thenmozhi; Kovács, György; Ángel García Cumbreras, Miguel (Ed.): Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, Association for Computational Linguistics, pp. 106-117, St. Julians, Malta, 2024.

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

@inproceedings{jablotschkin-etal-2024-de,
title = {DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis},
author = {Sarah Jablotschkin and Elke Teich and Heike Zinsmeister},
editor = {Bharathi Raya Chakravarthi and Bharathi B and Paul Buitelaar and Thenmozhi Durairaj and Gy{\"o}rgy Kov{\'a}cs and Miguel {\'A}ngel Garc{\'i}a Cumbreras},
url = {https://aclanthology.org/2024.ltedi-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion},
pages = {106-117},
publisher = {Association for Computational Linguistics},
address = {St. Julians, Malta},
abstract = {In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   T1

Jablotschkin, Sarah; Zinsmeister, Heike

LeiKo. Ein Vergleichskorpus für Leichte Sprache und Einfache Sprache Incollection

Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beiträge zur IDS-Methodenmesse 2022, Kupietz, Mark und Thomas Schmidt, Tübingen: Narr, 2023.

Mit dem Konzept “Easy-to-read” werden Teilsysteme natürlicher Sprachen bezeichnet, welche durch eine systematische Reduktion auf den Ebenen Lexik und Syntax entstehen und den Zugang zu geschriebenen Informationen für Erwachsene mit geringen Lesekompetenzen gewährleisten. Im Deutschen gibt es “Leichte Sprache”, welche sich nach spezifischen linguistischen und typografischen Regeln richtet, und die weniger restringierte “einfache Sprache”. Beide Varianten erhalten im akademischen sowie nicht-akademischen Diskurs vermehrt Aufmerksamkeit – nicht zuletzt dank der im Jahr 2009 in Deutschland ratifizierten UN-Behindertenrechtskonvention (UN-BRK).

@incollection{jablotschkin_zinsmeister_2023,
title = {LeiKo. Ein Vergleichskorpus f{\"u}r Leichte Sprache und Einfache Sprache},
author = {Sarah Jablotschkin and Heike Zinsmeister},
url = {https://www.ids-mannheim.de/fileadmin/aktuell/Jahrestagungen/2022/Methodenmesse/5_Jablotschkin_Zinsmeister_LeiKo.pdf},
year = {2023},
date = {2023},
booktitle = {Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beitr{\"a}ge zur IDS-Methodenmesse 2022},
publisher = {Kupietz, Mark und Thomas Schmidt},
address = {T{\"u}bingen: Narr},
abstract = {Mit dem Konzept “Easy-to-read” werden Teilsysteme nat{\"u}rlicher Sprachen bezeichnet, welche durch eine systematische Reduktion auf den Ebenen Lexik und Syntax entstehen und den Zugang zu geschriebenen Informationen f{\"u}r Erwachsene mit geringen Lesekompetenzen gew{\"a}hrleisten. Im Deutschen gibt es “Leichte Sprache”, welche sich nach spezifischen linguistischen und typografischen Regeln richtet, und die weniger restringierte “einfache Sprache”. Beide Varianten erhalten im akademischen sowie nicht-akademischen Diskurs vermehrt Aufmerksamkeit – nicht zuletzt dank der im Jahr 2009 in Deutschland ratifizierten UN-Behindertenrechtskonvention (UN-BRK).},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   T1

Jablotschkin, Sarah; Benz, Nele; Zinsmeister, Heike

Evaluation of neural coreference annotation of simplified German Conference

Posterpräsentation auf der Computational Linguistics Poster Session im Rahmen der 45. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) in Köln, 2023.

This poster presents our evaluation of a neural coreference resolver (Schröder et al. 2021) on simplified German texts as well as the results of an annotation study that we conducted in order to analyse error sources.

The underlying corpus can be found on Zenodo: https://doi.org/10.5281/zenodo.3626763

@conference{jablotschkin_sarah_2023_12252,
title = {Evaluation of neural coreference annotation of simplified German},
author = {Sarah Jablotschkin and Nele Benz and Heike Zinsmeister},
url = {https://doi.org/10.25592/uhhfdm.12252},
doi = {https://doi.org/10.25592/uhhfdm.12252},
year = {2023},
date = {2023},
booktitle = {Posterpr{\"a}sentation auf der Computational Linguistics Poster Session im Rahmen der 45. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS) in K{\"o}ln},
abstract = {This poster presents our evaluation of a neural coreference resolver (Schr{\"o}der et al. 2021) on simplified German texts as well as the results of an annotation study that we conducted in order to analyse error sources. The underlying corpus can be found on Zenodo: https://doi.org/10.5281/zenodo.3626763},
pubstate = {published},
type = {conference}
}

Copy BibTeX to Clipboard

Project:   T1

Successfully