Publications

Ortmann, Katrin; Voigtmann, Sophia; Dipper, Stefanie; Speyer, Augustin

An information-theoretic account of constituent order in the German middle field Book Chapter

Lemke, Tyll Robin; Schäfer, Lisa; Reich, Ingo;  (Ed.): Information structure and information theory, Language Science Press, pp. 55–86, Berlin, 2024.

This paper proposes a novel approach to explain object order in German. Although the order of constituents is relatively free in modern German, there are clear preferences for the order dative before accusative (nominal) objects and for the order given before new objects. A range of influential factors have been described in the literature, most prominently givenness and length. We assume processing-related reasons and use information-theoretic measures, in particular surprisal and DORM (Cuskley et al. 2021), to explore the interplay of information structure and information density as factors for object order. We propose a measure called DORMdiff and the corpus of variants method for comparing information profiles between different plausible constituent orders. Our investigations show that language users follow information-theoretic principles (UID, Levy & Jaeger 2007) in choosing the object order that leads to a more uniform distribution of information. We argue that this preference also explains deviations from the unmarked object order (i.e., accusative preceding dative and new preceding given) if it is associated with smoother information profiles.

@inbook{Ortmann-etal-2024,
title = {An information-theoretic account of constituent order in the German middle field},
author = {Katrin Ortmann and Sophia Voigtmann and Stefanie Dipper and Augustin Speyer},
editor = {Tyll Robin Lemke and Lisa Sch{\"a}fer and Ingo Reich},
url = {https://langsci-press.org/catalog/book/465},
doi = {https://doi.org/10.5281/zenodo.13383787},
year = {2024},
date = {2024},
booktitle = {Information structure and information theory},
pages = {55–86},
publisher = {Language Science Press},
address = {Berlin},
abstract = {This paper proposes a novel approach to explain object order in German. Although the order of constituents is relatively free in modern German, there are clear preferences for the order dative before accusative (nominal) objects and for the order given before new objects. A range of influential factors have been described in the literature, most prominently givenness and length. We assume processing-related reasons and use information-theoretic measures, in particular surprisal and DORM (Cuskley et al. 2021), to explore the interplay of information structure and information density as factors for object order. We propose a measure called DORMdiff and the corpus of variants method for comparing information profiles between different plausible constituent orders. Our investigations show that language users follow information-theoretic principles (UID, Levy & Jaeger 2007) in choosing the object order that leads to a more uniform distribution of information. We argue that this preference also explains deviations from the unmarked object order (i.e., accusative preceding dative and new preceding given) if it is associated with smoother information profiles.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Voigtmann, Sophia

Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des frühen Neuhochdeutschen PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Präpositionalphrasen und Relativsätzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile für die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (größerer) Sicherheit vorhergesagt werden können. Somit ist im Nachfeld mehr kognitive Kapazität zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu können, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit Länge assoziiert, auch ein gedächtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Außerdem wurde untersucht, ob Extraposition von der konzeptionellen Mündlichkeit eines Textes (vgl. Koch & Österreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Veränderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativsätze und Nominal- sowie Präpositionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verstärkt in mündlichkeitsnahen Texten verwendet. 2b) In Texten, die mündlichkeitsnäher sind, ist der Einfluss von hohen Surprisalwerten größer als in schriftlichkeitsnahen Texten. 3) Über die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur Überprüfung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden händisch alle extraponierten Nominal- und Präpositionalphrasen mit Gegenstücken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativsätze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe über das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. Über das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der Mündlichkeitsnähe, ermittelt. Zusätzlich wurde die Länge für jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen können, was mit deren vielfältigeren Funktionen – verglichen mit den Präpositionalphrasen und attributiven Relativsätzen – erklärt wird. Bei den beiden anderen Phänomenen spielt die Länge eine größere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Veränderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind außerdem mündlichkeitsnäher als die medizinischen Texte. Beide Genre werden über den untersuchten Zeitraum hinweg aber schriftlichkeitsnäher, was auch für eine Annäherung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen Mündlichkeitsnähe und Extraposition nur für Nominalphrasen bestätigt werden. Bei einer Zweiteilung des Korpus in mündlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den mündlichkeitsnahen Texten Extraposition erklären können. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der Länge die der Surprisalwerte in jüngeren Texten übersteigt. Es wurde dafür argumentiert, dass eine Gewöhnung an kürzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer für die Verarbeitung in den Mittelpunkt.

@phdthesis{Voigtmann_Diss_2024,
title = {Wie Informationsdichte Extraposition beeinflusst. Eine Korpusuntersuchung an wissenschaftlichen Texten des fr{\"u}hen Neuhochdeutschen},
author = {Sophia Voigtmann},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/37369},
doi = {https://doi.org/10.22028/D291-41751},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Die vorliegende Arbeit untersucht die Nachfeldstellung von Nominalphrasen, Pr{\"a}positionalphrasen und Relativs{\"a}tzen in wissenschaftlichen Texten des Zeitraums von 1650 bis 1900 mit einer Korpusstudie. Sie setzt Extraposition in Zusammenhang mit Verarbeitung. Dabei wird angenommen, dass das Nachfeld Vorteile f{\"u}r die Verarbeitung bietet, da hier alle notwendigen Aktanten des Satzes aufgrund der erfolgten Verarbeitung der rechten Satzklammer entweder bekannt sind oder mit (gr{\"o}{\ss}erer) Sicherheit vorhergesagt werden k{\"o}nnen. Somit ist im Nachfeld mehr kognitive Kapazit{\"a}t zur Verarbeitung lexikalischer Information frei. Um diese erwartungsbasierte Verarbeitung auch im historischen Kontext operationalisieren zu k{\"o}nnen, wird Surprisal im Sinne von Shannon (1948), Hale (2001) und Levy (2008) genutzt. Gleichzeitig ist aufgrund der bisherigen Forschung, die Extraposition vor allem mit L{\"a}nge assoziiert, auch ein ged{\"a}chtnisbasierter Verarbeitungsansatz in die Betrachtung von Extraposition eingeflossen. Au{\ss}erdem wurde untersucht, ob Extraposition von der konzeptionellen M{\"u}ndlichkeit eines Textes (vgl. Koch & {\"O}sterreicher 2007, Ortmann & Dipper 2024) beeinflusst wird. Auch Ver{\"a}nderungen innerhalb der untersuchten Periode wurden betrachtet. Daraus ergeben sich drei Hypothesen: 1) Relativs{\"a}tze und Nominal- sowie Pr{\"a}positionalphrasen mit hohen Surprisalwerten werden ausgelagert. 2a) Auslagerung wird verst{\"a}rkt in m{\"u}ndlichkeitsnahen Texten verwendet. 2b) In Texten, die m{\"u}ndlichkeitsn{\"a}her sind, ist der Einfluss von hohen Surprisalwerten gr{\"o}{\ss}er als in schriftlichkeitsnahen Texten. 3) {\"U}ber die Zeit wird der Einfluss der Informationsdichte auf Auslagerung geringer. Zur {\"U}berpr{\"u}fung dieser Hypothesen wurde ein Korpus aus medizinischen und theologischen Texten aus dem Deutschen Textarchiv (DTA, BAW 2019) gebildet. Darin wurden h{\"a}ndisch alle extraponierten Nominal- und Pr{\"a}positionalphrasen mit Gegenst{\"u}cken, sog. Minimalpaaren, sowie alle adjazenten und extraponierten Relativs{\"a}tze, die Satzklammern und gegebenenfalls Antezedenzien annotiert. Ebenso wurden die lemmabasierten Skipgramwerte pro 50-Jahresstufe {\"u}ber das Tool von Kusmirek et al. (2023) berechnet. Aus den so ermittelten Werten wurde das „durchschnittliche Surprisal“ der eingebetteten beziehungsweise extraponierten (Teil-)Konstituenten berechnet. {\"U}ber das COAST-Tool (Ortmann & Dipper 2022, 2024) wurde der Orality Score, ein automatisierter Score zur Bestimmung der M{\"u}ndlichkeitsn{\"a}he, ermittelt. Zus{\"a}tzlich wurde die L{\"a}nge f{\"u}r jede Konstituente bestimmt. Insgesamt konnte gezeigt werden, dass Surprisalwerte vor allem die Position von Nominalphrasen vorhersagen k{\"o}nnen, was mit deren vielf{\"a}ltigeren Funktionen – verglichen mit den Pr{\"a}positionalphrasen und attributiven Relativs{\"a}tzen – erkl{\"a}rt wird. Bei den beiden anderen Ph{\"a}nomenen spielt die L{\"a}nge eine gr{\"o}{\ss}ere Rolle. Des Weiteren finden sich Unterschiede zwischen den beiden Genres, die mit den Inhalten der Texte und der Schreibpraxis der jeweiligen Autorengruppen sowie Ver{\"a}nderungen in den beiden Wissenschaftsrichtungen in Zusammenhang gebracht werden. Die untersuchten theologischen Texte sind au{\ss}erdem m{\"u}ndlichkeitsn{\"a}her als die medizinischen Texte. Beide Genre werden {\"u}ber den untersuchten Zeitraum hinweg aber schriftlichkeitsn{\"a}her, was auch f{\"u}r eine Ann{\"a}herung beider Schreibstile zu sprechen scheint. Zudem kann der Zusammenhang zwischen M{\"u}ndlichkeitsn{\"a}he und Extraposition nur f{\"u}r Nominalphrasen best{\"a}tigt werden. Bei einer Zweiteilung des Korpus in m{\"u}ndlichkeitsnahe und schriftlichkeitsnahe Texte zeigt sich, dass die Surprisalwerte eher in den m{\"u}ndlichkeitsnahen Texten Extraposition erkl{\"a}ren k{\"o}nnen. Im Zusammenhang mit der dritten Hypothese wurde gezeigt, dass die Bedeutung der L{\"a}nge die der Surprisalwerte in j{\"u}ngeren Texten {\"u}bersteigt. Es wurde daf{\"u}r argumentiert, dass eine Gew{\"o}hnung an k{\"u}rzere Satzrahmen erfolgte und die Schreibpraxis der Theologen und Mediziner professioneller wird. Neben den Unterschieden zwischen den Genres und den Registern, stellt die Arbeit vor allem die Bedeutung der Satzklammer f{\"u}r die Verarbeitung in den Mittelpunkt.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C6

Dipper, Stefanie; Haiber, Cora; Schröter, Anna Maria; Wiemann, Alexandra; Brinkschulte, Maike

Universal Dependencies: Extensions for Modern and Historical German Inproceedings

Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen (Ed.): Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, pp. 17101-17111, Torino, Italia, 2024.

In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.

@inproceedings{dipper-etal-2024-universal-dependencies,
title = {Universal Dependencies: Extensions for Modern and Historical German},
author = {Stefanie Dipper and Cora Haiber and Anna Maria Schr{\"o}ter and Alexandra Wiemann and Maike Brinkschulte},
editor = {Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue},
url = {https://aclanthology.org/2024.lrec-main.1485},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages = {17101-17111},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present extensions of the UD scheme for modern and historical German. The extensions relate in part to fundamental differences such as those between different kinds of arguments and modifiers. We illustrate the extensions with examples from the MHG data and discuss a number of MHG-specific constructions. At the current time, we have annotated a corpus of Middle High German with almost 29K tokens using this scheme, which to our knowledge is the first UD treebank for Middle High German. Inter-annotator agreement is very high: the annotators achieve a score of α = 0.85. A statistical analysis of the annotations shows some interesting differences in the distribution of labels between modern and historical German.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin; Dipper, Stefanie

Nähetexte automatisch erkennen: Entwicklung eines linguistischen Scores für konzeptionelle Mündlichkeit in historischen Texten. Book Chapter

Imo, Wolfgang; Wesche, Jörg (Ed.): Sprechen und Gespräch in historischer Perspektive: Sprach-und literaturwissenschaftliche Zugänge, Metzler, pp. 17-36, Berlin, Heidelberg, 2024.

Dieser Beitrag stellt einen automatisch bestimmbaren Score zur Einschätzung der konzeptionellen Mündlichkeit eines historischen Textes vor. Der Score basiert auf einer Reihe von linguistischen Merkmalen wie durchschnittlicher Wortlänge, Häufigkeit von Personalpronomen der 1.Person, Verhältnis Vollverben zu Nomen oder dem Anteil von Inhaltswörtern am Gesamttext. Diese Merkmale werden bei der Berechnung des Mündlichkeits-Scores unterschiedlich gewichtet. Die Gewichte wurden mit Hilfe des Kasseler Junktionskorpus (Ágel und Hennig 2008) festgelegt, dessen Texte von Expert/innen mit Nähewerten versehen wurden. In einer 5-fachen Kreuzvalidierung zeigt sich,dass der automatisch bestimmte Mündlichkeits-Score in einem sehr hohen Maß mit dem Experten-Score korreliert (r = 0.9175).

@inbook{Ortmann_Dipper_2024,
title = {N{\"a}hetexte automatisch erkennen: Entwicklung eines linguistischen Scores f{\"u}r konzeptionelle M{\"u}ndlichkeit in historischen Texten.},
author = {Katrin Ortmann and Stefanie Dipper},
editor = {Wolfgang Imo and J{\"o}rg Wesche},
url = {https://link.springer.com/chapter/10.1007/978-3-662-67677-6_2},
year = {2024},
date = {2024},
booktitle = {Sprechen und Gespr{\"a}ch in historischer Perspektive: Sprach-und literaturwissenschaftliche Zug{\"a}nge},
pages = {17-36},
publisher = {Metzler},
address = {Berlin, Heidelberg},
abstract = {

Dieser Beitrag stellt einen automatisch bestimmbaren Score zur Einsch{\"a}tzung der konzeptionellen M{\"u}ndlichkeit eines historischen Textes vor. Der Score basiert auf einer Reihe von linguistischen Merkmalen wie durchschnittlicher Wortl{\"a}nge, H{\"a}ufigkeit von Personalpronomen der 1.Person, Verh{\"a}ltnis Vollverben zu Nomen oder dem Anteil von Inhaltsw{\"o}rtern am Gesamttext. Diese Merkmale werden bei der Berechnung des M{\"u}ndlichkeits-Scores unterschiedlich gewichtet. Die Gewichte wurden mit Hilfe des Kasseler Junktionskorpus ({\'A}gel und Hennig 2008) festgelegt, dessen Texte von Expert/innen mit N{\"a}hewerten versehen wurden. In einer 5-fachen Kreuzvalidierung zeigt sich,dass der automatisch bestimmte M{\"u}ndlichkeits-Score in einem sehr hohen Ma{\ss} mit dem Experten-Score korreliert (r = 0.9175).
},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Voigtmann, Sophia; Speyer, Augustin

Where to place a phrase? Journal Article

Journal of Historical Syntax, 7, Proceedings of the 22nd Diachronic Generative Syntax (DiGS) Conference, 2023.
In the following paper, we aim to cast light on the placement of prepositional phrases (PPs) in the so-called postfield, the position behind the right sentence bracket. Our focus is on the period of early New High German from 1650 to 1900. In a first step, extraposition will be correlated with Information Density (’ID’, Shannon 1948). ID is defined as “amount of information per unit comprising the utterance” (Levy & Jaeger 2007: 1). It can be calculated as surprisal. The higher the surprisal values the higher the impact on working memory and the more likely perceiving di?iculties become (e.g. Hale 2001). We expect PP with such high surprisal values to be more likely to be placed in the postfield where more memory capacities are available than in the middle field. We test this hypothesis on a corpus of scientific articles and monographs dealing with medicine and theology and taken from the Deutsches Textarchiv (DTA, BBAW 2019). We only find evidence for the hypothesis in the timespan from 1650 to 1700 and for the rare case that attributive PPs are placed in the postfield. Since this has already been shown for attributive relative clauses (Voigtmann & Speyer 2021), we want to take this up and argue for a similar generative analysis for attributive PP and relative clauses in a second step.

@article{voigtmann_speyer_2023,
title = {Where to place a phrase?},
author = {Sophia Voigtmann and Augustin Speyer},
url = {https://doi.org/10.18148/HS/2023.V7I6-19.151},
year = {2023},
date = {2023},
journal = {Journal of Historical Syntax},
publisher = {Proceedings of the 22nd Diachronic Generative Syntax (DiGS) Conference},
volume = {7},
number = {6-19},
abstract = {

In the following paper, we aim to cast light on the placement of prepositional phrases (PPs) in the so-called postfield, the position behind the right sentence bracket. Our focus is on the period of early New High German from 1650 to 1900. In a first step, extraposition will be correlated with Information Density (’ID’, Shannon 1948). ID is defined as “amount of information per unit comprising the utterance” (Levy & Jaeger 2007: 1). It can be calculated as surprisal. The higher the surprisal values the higher the impact on working memory and the more likely perceiving di?iculties become (e.g. Hale 2001). We expect PP with such high surprisal values to be more likely to be placed in the postfield where more memory capacities are available than in the middle field. We test this hypothesis on a corpus of scientific articles and monographs dealing with medicine and theology and taken from the Deutsches Textarchiv (DTA, BBAW 2019). We only find evidence for the hypothesis in the timespan from 1650 to 1700 and for the rare case that attributive PPs are placed in the postfield. Since this has already been shown for attributive relative clauses (Voigtmann & Speyer 2021), we want to take this up and argue for a similar generative analysis for attributive PP and relative clauses in a second step.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin

Computational Methods for Investigating Syntactic Change: Automatic Identification of Extraposition in Modern and Historical German PhD Thesis

Bochumer Linguistische Arbeitsberichte (BLA) 25, 2023.

The linguistic analysis of historical German and diachronic syntactic change is traditionally based on small, manually annotated data sets. As a consequence, such studies lack the generalizability and statistical significance that quantitative approaches can offer. In this thesis, computational methods for the automatic syntactic analysis of modern and historical German are developed, which help to overcome the natural limits of manual annotation and enable the creation of large annotated data sets. The main goal of the thesis is to identify extraposition in modern and historical German, with extraposition being defined as the movement of constituents from their base position to the post-field of the sentence (Höhle 2019; Wöllstein 2018). For the automatic recognition of extraposition, two annotation steps are combined: (i) a topological field analysis for the identification of post-fields and (ii) a constituency analysis to recognize candidates for extraposition. The thesis describes experiments on topological field parsing (Ortmann 2020), chunking (Ortmann 2021a), and constituency parsing (Ortmann 2021b). The best results are achieved with statistical models trained on Part-of-Speech tags as input. Contrary to previous studies, all annotation steps are thoroughly evaluated with the newly developed FairEval method for the fine-grained error analysis and fair evaluation of labeled spans (Ortmann 2022). In an example analysis, the created methods are applied to large collections of modern and historical text to explore different factors for the extraposition of relative clauses, demonstrating the practical value of computational approaches for linguistic studies. The developed methods are released as the CLASSIG pipeline (Computational Linguistic Analysis of Syntactic Structures In German) at https://github.com/rubcompling/classig- pipeline. Data sets, models, and evaluation results are provided for download at https://github.com/rubcompling/classig-data and https://doi.org/10.5281/zenodo.7180973.

@phdthesis{ortmann23,
title = {Computational Methods for Investigating Syntactic Change: Automatic Identification of Extraposition in Modern and Historical German},
author = {Katrin Ortmann},
url = {https://www.linguistics.rub.de/forschung/arbeitsberichte/25.pdf},
year = {2023},
date = {2023},
publisher = {Bochumer Linguistische Arbeitsberichte (BLA) 25},
abstract = {The linguistic analysis of historical German and diachronic syntactic change is traditionally based on small, manually annotated data sets. As a consequence, such studies lack the generalizability and statistical significance that quantitative approaches can offer. In this thesis, computational methods for the automatic syntactic analysis of modern and historical German are developed, which help to overcome the natural limits of manual annotation and enable the creation of large annotated data sets. The main goal of the thesis is to identify extraposition in modern and historical German, with extraposition being defined as the movement of constituents from their base position to the post-field of the sentence (H{\"o}hle 2019; W{\"o}llstein 2018). For the automatic recognition of extraposition, two annotation steps are combined: (i) a topological field analysis for the identification of post-fields and (ii) a constituency analysis to recognize candidates for extraposition. The thesis describes experiments on topological field parsing (Ortmann 2020), chunking (Ortmann 2021a), and constituency parsing (Ortmann 2021b). The best results are achieved with statistical models trained on Part-of-Speech tags as input. Contrary to previous studies, all annotation steps are thoroughly evaluated with the newly developed FairEval method for the fine-grained error analysis and fair evaluation of labeled spans (Ortmann 2022). In an example analysis, the created methods are applied to large collections of modern and historical text to explore different factors for the extraposition of relative clauses, demonstrating the practical value of computational approaches for linguistic studies. The developed methods are released as the CLASSIG pipeline (Computational Linguistic Analysis of Syntactic Structures In German) at https://github.com/rubcompling/classig- pipeline. Data sets, models, and evaluation results are provided for download at https://github.com/rubcompling/classig-data and https://doi.org/10.5281/zenodo.7180973.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin

Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans Inproceedings

Proceedings of the Language Resources and Evaluation Conference (LREC), European Language Resources Association, pp. 1400-1407, Marseille, France, 2022.

The traditional evaluation of labeled spans with precision, recall, and F1-score has undesirable effects due to double penalties. Annotations with incorrect label or boundaries count as two errors instead of one, despite being closer to the target annotation than false positives or false negatives. In this paper, new error types are introduced, which more accurately reflect true annotation quality and ensure that every annotation counts only once. An algorithm for error identification in flat and multi-level annotations is presented and complemented with a proposal on how to calculate meaningful precision, recall, and F1-scores based on the more fine-grained error types. The exemplary application to three different annotation tasks (NER, chunking, parsing) shows that the suggested procedure not only prevents double penalties but also allows for a more detailed error analysis, thereby providing more insight into the actual weaknesses of a system.

@inproceedings{ortmann2022,
title = {Fine-Grained Error Analysis and Fair Evaluation of Labeled Spans},
author = {Katrin Ortmann},
url = {https://aclanthology.org/2022.lrec-1.150},
year = {2022},
date = {2022-06-21},
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
pages = {1400-1407},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {The traditional evaluation of labeled spans with precision, recall, and F1-score has undesirable effects due to double penalties. Annotations with incorrect label or boundaries count as two errors instead of one, despite being closer to the target annotation than false positives or false negatives. In this paper, new error types are introduced, which more accurately reflect true annotation quality and ensure that every annotation counts only once. An algorithm for error identification in flat and multi-level annotations is presented and complemented with a proposal on how to calculate meaningful precision, recall, and F1-scores based on the more fine-grained error types. The exemplary application to three different annotation tasks (NER, chunking, parsing) shows that the suggested procedure not only prevents double penalties but also allows for a more detailed error analysis, thereby providing more insight into the actual weaknesses of a system.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Voigtmann, Sophia

Informational aspects of the extraposition of relative clauses Inproceedings

Coniglio, Marco; de Bastiani, Chiara; Catasso, Nicholas (Ed.): Language Change at the Interfaces. Intrasentential and intersentential phenomena, 2022.

One reason why intertwined clauses are harder to perceive than their counterparts might be information structure (IS). The current idea is that speakers choose extraposition of new information to help the audience perceive their text and to distribute information evenly across a discourse (Levy and Jaeger 2007). The purpose of this paper is to verify the relation between IS and extraposition or
adjacency of relative clauses (RC).

@inproceedings{voigtmanninprint,
title = {Informational aspects of the extraposition of relative clauses},
author = {Sophia Voigtmann},
editor = {Marco Coniglio and Chiara de Bastiani and Nicholas Catasso},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/ag7/Voigtmann.pdf},
year = {2022},
date = {2022},
booktitle = {Language Change at the Interfaces. Intrasentential and intersentential phenomena},
abstract = {One reason why intertwined clauses are harder to perceive than their counterparts might be information structure (IS). The current idea is that speakers choose extraposition of new information to help the audience perceive their text and to distribute information evenly across a discourse (Levy and Jaeger 2007). The purpose of this paper is to verify the relation between IS and extraposition or adjacency of relative clauses (RC).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Voigtmann, Sophia; Speyer, Augustin

Information density and the extraposition of German relative clauses Journal Article

Frontiers in Psychology, pp. 1-18, 2021.

This paper aims to find a correlation between Information Density (ID) and extraposition of Relative Clauses (RC) in Early New High German. Since surprisal is connected to perceiving difficulties, the impact on the working memory is lower for frequent combinations with low surprisal-values than it is for rare combinations with higher surprisal-values. To improve text comprehension, producers therefore distribute information as evenly as possible across a discourse. Extraposed RC are expected to have a higher surprisal-value than embedded RC. We intend to find evidence for this idea in RC taken from scientific texts from the 17th to 19th century. We built a corpus of tokenized, lemmatized and normalized papers about medicine from the 17th and 19th century, manually determined the RC-variants and calculated a skipgram-Language Model to compute the 2-Skip-bigram surprisal of every word of the relevant sentences. A logistic regression over the summed up surprisal values shows a significant result, which indicates a correlation between surprisal values and extraposition. So, for these periods it can be said that RC are more likely to be extraposed when they have a high total surprisal value. The influence of surprisal values also seems to be stable across time. The comparison of the analyzed language periods shows no significant change.

@article{Voigtmann.Speyer,
title = {Information density and the extraposition of German relative clauses},
author = {Sophia Voigtmann and Augustin Speyer},
url = {https://doi.org/10.3389/fpsyg.2021.650969},
doi = {https://doi.org/10.3389/fpsyg.2021.650969},
year = {2021},
date = {2021-11-26},
journal = {Frontiers in Psychology},
pages = {1-18},
abstract = {This paper aims to find a correlation between Information Density (ID) and extraposition of Relative Clauses (RC) in Early New High German. Since surprisal is connected to perceiving difficulties, the impact on the working memory is lower for frequent combinations with low surprisal-values than it is for rare combinations with higher surprisal-values. To improve text comprehension, producers therefore distribute information as evenly as possible across a discourse. Extraposed RC are expected to have a higher surprisal-value than embedded RC. We intend to find evidence for this idea in RC taken from scientific texts from the 17th to 19th century. We built a corpus of tokenized, lemmatized and normalized papers about medicine from the 17th and 19th century, manually determined the RC-variants and calculated a skipgram-Language Model to compute the 2-Skip-bigram surprisal of every word of the relevant sentences. A logistic regression over the summed up surprisal values shows a significant result, which indicates a correlation between surprisal values and extraposition. So, for these periods it can be said that RC are more likely to be extraposed when they have a high total surprisal value. The influence of surprisal values also seems to be stable across time. The comparison of the analyzed language periods shows no significant change.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin

Automatic Phrase Recognition in Historical German Inproceedings

Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021), KONVENS 2021 Organizers, pp. 127–136, Düsseldorf, Germany, 2021.

Due to a lack of annotated data, theories of historical syntax are often based on very small, manually compiled data sets. To enable the empirical evaluation of existing hypotheses, the present study explores the automatic recognition of phrases in historical German. Using modern and historical treebanks, training data for a neural sequence labeling tool and a probabilistic parser is created, and both methods are compared on a variety of data sets. The evaluation shows that the unlexicalized parser outperforms the sequence labeling approach, achieving F1-scores of 87%–91% on modern German and between 73% and 85% on different historical corpora. An error analysis indicates that accuracy decreases especially for longer phrases, but most of the errors concern incorrect phrase boundaries, suggesting further potential for improvement.

@inproceedings{ortmann-2021b,
title = {Automatic Phrase Recognition in Historical German},
author = {Katrin Ortmann},
url = {https://aclanthology.org/2021.konvens-1.11},
year = {2021},
date = {2021-09-06},
booktitle = {Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)},
pages = {127–136},
publisher = {KONVENS 2021 Organizers},
address = {D{\"u}sseldorf, Germany},
abstract = {Due to a lack of annotated data, theories of historical syntax are often based on very small, manually compiled data sets. To enable the empirical evaluation of existing hypotheses, the present study explores the automatic recognition of phrases in historical German. Using modern and historical treebanks, training data for a neural sequence labeling tool and a probabilistic parser is created, and both methods are compared on a variety of data sets. The evaluation shows that the unlexicalized parser outperforms the sequence labeling approach, achieving F1-scores of 87%–91% on modern German and between 73% and 85% on different historical corpora. An error analysis indicates that accuracy decreases especially for longer phrases, but most of the errors concern incorrect phrase boundaries, suggesting further potential for improvement.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Voigtmann, Sophia; Speyer, Augustin

Information density as a factor for syntactic variation in Early New High German Inproceedings

Proceedings of Linguistic Evidence 2020, Tübingen, Germany, 2021.

In contrast to other languages like English, German has certain liberties in its word order. Different word orders do not influence the proposition of a sentence. The frame of the German clause are the sentence brackets (the left (LSB) and the right (RSB) sentence brackets) over which the parts of the predicate are distributed in the main clause, whereas in subordinate clauses, the left one can host subordinate conjunctions. But apart from the sentence brackets, the order of constituents is fairly variable, though a default word order (subject, indirect object, direct object for nouns; subject, direct object, indirect object for pronouns) exists. A deviation of this order can be caused by factors like focus, given-/newness, topicality, definiteness and animacy (Zubin & Köpcke, 1985; Reis, 1987; Müller, 1999; Lenerz, 2001 among others).

@inproceedings{voigtmannspeyerinprint,
title = {Information density as a factor for syntactic variation in Early New High German},
author = {Sophia Voigtmann and Augustin Speyer},
url = {https://ub01.uni-tuebingen.de/xmlui/handle/10900/134561},
year = {2021},
date = {2021},
booktitle = {Proceedings of Linguistic Evidence 2020},
address = {T{\"u}bingen, Germany},
abstract = {In contrast to other languages like English, German has certain liberties in its word order. Different word orders do not influence the proposition of a sentence. The frame of the German clause are the sentence brackets (the left (LSB) and the right (RSB) sentence brackets) over which the parts of the predicate are distributed in the main clause, whereas in subordinate clauses, the left one can host subordinate conjunctions. But apart from the sentence brackets, the order of constituents is fairly variable, though a default word order (subject, indirect object, direct object for nouns; subject, direct object, indirect object for pronouns) exists. A deviation of this order can be caused by factors like focus, given-/newness, topicality, definiteness and animacy (Zubin & K{\"o}pcke, 1985; Reis, 1987; M{\"u}ller, 1999; Lenerz, 2001 among others).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Speyer, Augustin; Voigtmann, Sophia

Informationelle Bedingungen für die Selbständigkeit kausaler Satzaussagen. Eine diachrone Sichtweise Book Chapter

Külpmann, Robert; Finkbeiner, Rita (Ed.): Neues zur Selbstständigkeit von Sätzen, Linguistische Berichte, Sonderheft , Buske, pp. 177-206, Hamburg, 2021.

Das Deutsche bietet mehrere Möglichkeiten, eine Satzaussage, die in einer bestimmten logischen Beziehung zu einer anderen steht, zu kodieren. Relevant für das Thema des Workshops ist die Variation zwischen selbständigen und unselbständigen Versionen, wie es
am Beispiel einer kausalen Beziehung in (1) demonstriert ist.
(1) a. Uller kam früher nach Hause, weil Gwendolyn etwas mit ihm bereden wollte.
b. Uller kam früher nach Hause. (Denn) Gwendolyn wollte etwas mit ihm bereden.
Gerade zur Variation bei kausalen Verhältnissen ist in der Vergangenheit viel gearbeitet worden.

@inbook{speyervoigtmann_Bedingungen,
title = {Informationelle Bedingungen f{\"u}r die Selbst{\"a}ndigkeit kausaler Satzaussagen. Eine diachrone Sichtweise},
author = {Augustin Speyer and Sophia Voigtmann},
editor = {Robert K{\"u}lpmann and Rita Finkbeiner},
url = {https://buske.de/neues-zur-selbststandigkeit-von-satzen-16620.html},
doi = {https://doi.org/10.46771/978-3-96769-170-2},
year = {2021},
date = {2021},
booktitle = {Neues zur Selbstst{\"a}ndigkeit von S{\"a}tzen},
pages = {177-206},
publisher = {Buske},
address = {Hamburg},
abstract = {Das Deutsche bietet mehrere M{\"o}glichkeiten, eine Satzaussage, die in einer bestimmten logischen Beziehung zu einer anderen steht, zu kodieren. Relevant f{\"u}r das Thema des Workshops ist die Variation zwischen selbst{\"a}ndigen und unselbst{\"a}ndigen Versionen, wie es am Beispiel einer kausalen Beziehung in (1) demonstriert ist. (1) a. Uller kam fr{\"u}her nach Hause, weil Gwendolyn etwas mit ihm bereden wollte. b. Uller kam fr{\"u}her nach Hause. (Denn) Gwendolyn wollte etwas mit ihm bereden. Gerade zur Variation bei kausalen Verh{\"a}ltnissen ist in der Vergangenheit viel gearbeitet worden.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Speyer, Augustin; Voigtmann, Sophia

Factors for the integration of causal clauses in the history of German Book Chapter

Jedrzejowski, Lukasz; Fleczoreck, Constanze (Ed.): Micro- and Macro-variation of Causal Clauses: Synchronic and Diachronic Insights, John Benjamins Publishing Company, pp. 311–345, 2021.

The variation between integrated (verb-final) and independent (verb-second) causal clauses in German could depend on the amount of information conveyed in that clause. A lower amount might lead to integration, a higher amount to independence, as processing constraints might forbid integration of highly informative clauses. We use two ways to measure information amount: 1. the average ratio of given referents within the clause, 2. the cumulative surprisal of all words in the clause. Focusing on historical stages of German, a significant correlation between amount of information and integration was visible, regardless which method was used.

@inbook{speyervoigtmanninprinta,
title = {Factors for the integration of causal clauses in the history of German},
author = {Augustin Speyer and Sophia Voigtmann},
editor = {Lukasz Jedrzejowski and Constanze Fleczoreck},
url = {https://benjamins.com/catalog/slcs.231.11spe},
doi = {https://doi.org/10.1075/slcs.231.11spe},
year = {2021},
date = {2021},
booktitle = {Micro- and Macro-variation of Causal Clauses: Synchronic and Diachronic Insights},
pages = {311–345},
publisher = {John Benjamins Publishing Company},
abstract = {

The variation between integrated (verb-final) and independent (verb-second) causal clauses in German could depend on the amount of information conveyed in that clause. A lower amount might lead to integration, a higher amount to independence, as processing constraints might forbid integration of highly informative clauses. We use two ways to measure information amount: 1. the average ratio of given referents within the clause, 2. the cumulative surprisal of all words in the clause. Focusing on historical stages of German, a significant correlation between amount of information and integration was visible, regardless which method was used.

},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin

Chunking Historical German Inproceedings

Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Linköping University Electronic Press, Sweden, pp. 190-199, Reykjavik, Iceland (Online), 2021.

Quantitative studies of historical syntax require large amounts of syntactically annotated data, which are rarely available. The application of NLP methods could reduce manual annotation effort, provided that they achieve sufficient levels of accuracy. The present study investigates the automatic identification of chunks in historical German texts. Because no training data exists for this task, chunks are extracted from modern and historical constituency treebanks and used to train a CRF-based neural sequence labeling tool. The evaluation shows that the neural chunker outperforms an unlexicalized baseline and achieves overall F-scores between 90% and 94% for different historical data sets when POS tags are used as feature. The conducted experiments demonstrate the usefulness of including historical training data while also highlighting the importance of reducing boundary errors to improve annotation precision.

@inproceedings{Ortmann2021,
title = {Chunking Historical German},
author = {Katrin Ortmann},
url = {https://aclanthology.org/2021.nodalida-main.19},
year = {2021},
date = {2021},
booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)},
pages = {190-199},
publisher = {Link{\"o}ping University Electronic Press, Sweden},
address = {Reykjavik, Iceland (Online)},
abstract = {Quantitative studies of historical syntax require large amounts of syntactically annotated data, which are rarely available. The application of NLP methods could reduce manual annotation effort, provided that they achieve sufficient levels of accuracy. The present study investigates the automatic identification of chunks in historical German texts. Because no training data exists for this task, chunks are extracted from modern and historical constituency treebanks and used to train a CRF-based neural sequence labeling tool. The evaluation shows that the neural chunker outperforms an unlexicalized baseline and achieves overall F-scores between 90% and 94% for different historical data sets when POS tags are used as feature. The conducted experiments demonstrate the usefulness of including historical training data while also highlighting the importance of reducing boundary errors to improve annotation precision.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin

Automatic Topological Field Identification in (Historical) German Texts Inproceedings

Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 10-18, Barcelona, Spain (online), 2020.

For the study of certain linguistic phenomena and their development over time, large amounts of textual data must be enriched with relevant annotations. Since the manual creation of such annotations requires a lot of effort, automating the process with NLP methods would be convenient. But the required amounts of training data are usually not available for non-standard or historical language. The present study investigates whether models trained on modern newspaper text can be used to automatically identify topological fields, i.e. syntactic structures, in different modern and historical German texts. The evaluation shows that, in general, it is possible to transfer a parser model to other registers or time periods with overall F1-scores >92%. However, an error analysis makes clear that additional rules and domain-specific training data would be beneficial if sentence structures differ significantly from the training data, e.g. in the case of Early New High German.

@inproceedings{Ortmann2020b,
title = {Automatic Topological Field Identification in (Historical) German Texts},
author = {Katrin Ortmann},
url = {https://www.aclweb.org/anthology/2020.latechclfl-1.2},
year = {2020},
date = {2020-12-12},
booktitle = {Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature},
pages = {10-18},
address = {Barcelona, Spain (online)},
abstract = {For the study of certain linguistic phenomena and their development over time, large amounts of textual data must be enriched with relevant annotations. Since the manual creation of such annotations requires a lot of effort, automating the process with NLP methods would be convenient. But the required amounts of training data are usually not available for non-standard or historical language. The present study investigates whether models trained on modern newspaper text can be used to automatically identify topological fields, i.e. syntactic structures, in different modern and historical German texts. The evaluation shows that, in general, it is possible to transfer a parser model to other registers or time periods with overall F1-scores >92%. However, an error analysis makes clear that additional rules and domain-specific training data would be beneficial if sentence structures differ significantly from the training data, e.g. in the case of Early New High German.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin; Dipper, Stefanie

Automatic Orality Identification in Historical Texts Inproceedings

Proceedings of The 12th Language Resources and Evaluation Conference (LREC), European Language Resources Association, pp. 1293-1302, Marseille, France, 2020.

Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages.

We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.

@inproceedings{Ortmann2020,
title = {Automatic Orality Identification in Historical Texts},
author = {Katrin Ortmann and Stefanie Dipper},
url = {https://www.aclweb.org/anthology/2020.lrec-1.162/},
year = {2020},
date = {2020},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)},
pages = {1293-1302},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages. We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin; Roussel, Adam; Dipper, Stefanie

Evaluating Off-the-Shelf NLP Tools for German Inproceedings

Proceedings of the Conference on Natural Language Processing (KONVENS), pp. 212-222, Erlangen, Germany, 2019.

It is not always easy to keep track of what toolsarecurrentlyavailableforaparticular annotation task, nor is it obvious how the provided models will perform on a given dataset. Inthiscontribution,weprovidean overview of the tools available for the automatic annotation of German-language text. We evaluate fifteen free and open source NLP tools for the linguistic annotation of German, looking at the fundamental NLP tasks of sentence segmentation, tokenization, POS tagging, morphological analysis, lemmatization, and dependency parsing. To get an idea of how the systems’ performance will generalize to various domains, we compiled our test corpus from various non-standard domains. All of the systems in our study are evaluated not only with respect to accuracy, but also the computational resources required.

@inproceedings{Ortmann2019b,
title = {Evaluating Off-the-Shelf NLP Tools for German},
author = {Katrin Ortmann and Adam Roussel and Stefanie Dipper},
url = {https://github.com/rubcompling/konvens2019},
year = {2019},
date = {2019},
booktitle = {Proceedings of the Conference on Natural Language Processing (KONVENS)},
pages = {212-222},
address = {Erlangen, Germany},
abstract = {It is not always easy to keep track of what toolsarecurrentlyavailableforaparticular annotation task, nor is it obvious how the provided models will perform on a given dataset. Inthiscontribution,weprovidean overview of the tools available for the automatic annotation of German-language text. We evaluate fifteen free and open source NLP tools for the linguistic annotation of German, looking at the fundamental NLP tasks of sentence segmentation, tokenization, POS tagging, morphological analysis, lemmatization, and dependency parsing. To get an idea of how the systems’ performance will generalize to various domains, we compiled our test corpus from various non-standard domains. All of the systems in our study are evaluated not only with respect to accuracy, but also the computational resources required.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Ortmann, Katrin; Dipper, Stefanie

Variation between Different Discourse Types: Literate vs. Oral Inproceedings

In Proceedings of the NAACL-Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Association for Computational Linguistics, pp. 64-79, Ann Arbor, Michigan, 2019.

This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers

@inproceedings{Ortmann2019,
title = {Variation between Different Discourse Types: Literate vs. Oral},
author = {Katrin Ortmann and Stefanie Dipper},
url = {https://aclanthology.org/W19-1407/},
doi = {https://doi.org/10.18653/v1/W19-1407},
year = {2019},
date = {2019-06-07},
booktitle = {In Proceedings of the NAACL-Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)},
pages = {64-79},
publisher = {Association for Computational Linguistics},
address = {Ann Arbor, Michigan},
abstract = {This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Successfully