Jablotschkin, Sarah; Teich, Elke; Zinsmeister, Heike

DE-Lite – a New Corpus of Easy German: Compilation, Exploration, Analysis

Raya Chakravarthi, Bharathi; B, Bharathi; Buitelaar, Paul; Durairaj, Thenmozhi; Kovács, György; Ángel García Cumbreras, Miguel (Ed.): Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion, Association for Computational Linguistics, pp. 106-117, St. Julians, Malta, 2024.

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.