Two new German corpora and their use: (1) A German Parallel Clausal Coordinate Ellipsis Corpus, and (2) A Treebank of Leichte Sprache texts

Abstract

The presentation is divided into two parts:

(1) First, we present a new German resource for coordinated sentences, including asyndetons (joint work with Denis Memmesheimer). It aligns cases from TüBa-D/Z exhibiting Clausal Coordinate Ellipsis (CCE) with the ellipsis-reconstructed sentences, the so-called canonical forms. CCE (=Gapping (including Long-Distance/Sub-Gapping, and Stripping), Forward- and Backward Conjunction Reduction (BCR and FCR, resp.), and Subject Gap in clauses with Finite/Fronted verb (SGF)) omits constituents or words, respectively, in one of the conjuncts under certain conditions. Corpus studies confirm that several elision phenomena can occur simultaneously (cf. the canonical form of a sentence from TIGER, where the reconstructed elements are in bold with subscripts: ‚b‘ for BCR, and ‚g‘ for Gapping: Monopole sollen geknackt werden_{_b} und Märkte sollen_{_g} getrennt werden. ‚Monopolies should be broken and markets should be divided.‘). Interestingly, this sentence has another realization option using Long-Distance Gapping (LDG). However, LDG is rare in the corpus material.

We outline the use of the corpus for the evaluation of OPIELLE, a system that takes the chart data structure of a PCFG parser as input to produce the canonical form of the input sentence. Currently, OPIELLE achieves a BLEU score of 0.928, whereas even state-of-the-art constituency parsers have difficulties with CCE sentences. Although these sentences occur in sufficient numbers in both written and spoken corpora, they are often among those with the lowest F1 scores. Our new parallel corpus is designed to support the development of effective models for machine learning or natural language processing components that can automatically reconstruct CCE phenomena.

(2) In the second part of the presentation, a treebank of Leichte Sprache (LS; easy-to-read German) texts will be outlined (joint work with Ina Steinmetz). Leichte Sprache defines a variety of German that is characterized by simplified syntactic constructions and a small vocabulary. So far, LS is mainly provided for, but rarely written by, its target group, which includes low-literate people with intellectual or developmental disabilities (IDD) and/or complex communication needs (CCN). We call them ‚the users‘ here. They evaluated the LS text for ease of understanding. The use and production of LS text is therefore asymmetrical – in general, LS readers do not actively contribute to the written discourse themselves. Under this premise, we collected 50% texts with and 50% without proofreading by the target group. The whole collection with 29,170 sentences was automatically parsed with the dependency parser PARZU.

We use the treebank to define the syntactic grammar of Extended LS (ELS) that the users would like to use to write down their train of thoughts. This research question is driven by an observation from the LeiSA project in Leipzig. A number of constructions were identified which were judged to be easy to understand — and presumably, easy to produce — but which fell outside the definition of LS. In addition, it is often difficult to reconstruct the coherence —realized in non-LS text, for example, by the conjunctions in subordinate clauses — between consecutive LS sentences with mandatory SVO word order. ELS suggests the systematic use of rhetorical relations between sentences. The appropriateness of the ELS constructions provided in EASYTAK, an ELS writing system, is illustrated by a recent usability study.