Publications

Shi, Wei

Addressing the data bottleneck in implicit discourse relation classification PhD Thesis

Saarland University, Saarbruecken, Germany, 2020.

When humans comprehend language, their interpretation consists of more than just the sum of the content of the sentences. Additional logic and semantic links (known as coherence relations or discourse relations) are inferred between sentences/clauses in the text. The identification of discourse relations is beneficial for various NLP applications such as question-answering, summarization, machine translation, information extraction, etc. Discourse relations are categorized into implicit and explicit discourse relations depending on whether there is an explicit discourse marker between the arguments. In this thesis, we mainly focus on the implicit discourse relation classification, given that with the explicit markers acting as informative cues, the explicit relations are relatively easier to identify for machines. The recent neural network-based approaches in particular suffer from insufficient training (and test) data. As shown in Chapter 3 of this thesis, we start out by showing to what extent the limited data size is a problem in implicit discourse relation classification and propose data augmentation methods with the help of cross-lingual data. And then we propose several approaches for better exploiting and encoding various types of existing data in the discourse relation classification task. Most of the existing machine learning methods train on sections 2-21 of the PDTB and test on section 23, which only includes a total of less than 800 implicit discourse relation instances. With the help of cross validation, we argue that the standard test section of the PDTB is too small to draw conclusions upon. With more test samples in the cross validation, we would come to very different conclusions about whether a feature is generally useful. Second, we propose a simple approach to automatically extract samples of implicit discourse relations from multilingual parallel corpus via back-translation. After back-translating from target languages, it is easy for the discourse parser to identify those examples that are originally implicit but explicit in the back-translations. Having those additional data in the training set, the experiments show significant improvements on different settings. Finally, having better encoding ability is also of crucial importance in terms of improving classification performance. We propose different methods including a sequence-to-sequence neural network and a memory component to help have a better representation of the arguments. We also show that having the correct next sentence is beneficial for the task within and across domains, with the help of the BERT (Devlin et al., 2019) model. When it comes to a new domain, it is beneficial to integrate external domain-specific knowledge. In Chapter 8, we show that with the entity-enhancement, the performance on BioDRB is improved significantly, comparing with other BERT-based methods. In sum, the studies reported in this dissertation contribute to addressing the data bottleneck problem in implicit discourse relation classification and propose corresponding approaches that achieve 54.82% and 69.57% on PDTB and BioDRB respectively.


Wenn Menschen Sprache verstehen, besteht ihre Interpretation aus mehr als nur der Summe des Inhalts der Sätze. Zwischen Sätzen im Text werden zusätzliche logische und semantische Verknüpfungen (sogenannte Kohärenzrelationen oder Diskursrelationen) hergeleitet. Die Identifizierung von Diskursrelationen ist für verschiedene NLP-Anwendungen wie Frage- Antwort, Zusammenfassung, maschinelle Übersetzung, Informationsextraktion usw. von Vorteil. Diskursrelationen werden in implizite und explizite Diskursrelationen unterteilt, je nachdem, ob es eine explizite Diskursrelationen zwischen den Argumenten gibt. In dieser Arbeit konzentrieren wir uns hauptsächlich auf die Klassifizierung der impliziten Diskursrelationen, da die expliziten Marker als hilfreiche Hinweise dienen und die expliziten Beziehungen für Maschinen relativ leicht zu identifizieren sind. Es wurden verschiedene Ansätze vorgeschlagen, die bei der impliziten Diskursrelationsklassifikation beeindruckende Ergebnisse erzielt haben. Die meisten von ihnen leiden jedoch darunter, dass die Daten für auf neuronalen Netzen basierende Methoden unzureichend sind. In dieser Arbeit gehen wir zunächst auf das Problem begrenzter Daten bei dieser Aufgabe ein und schlagen dann Methoden zur Datenanreicherung mit Hilfe von sprachübergreifenden Daten vor. Zuletzt schlagen wir mehrere Methoden vor, um die Argumente aus verschiedenen Aspekten besser kodieren zu können. Die meisten der existierenden Methoden des maschinellen Lernens werden auf den Abschnitten 2-21 der PDTB trainiert und auf dem Abschnitt 23 getestet, der insgesamt nur weniger als 800 implizite Diskursrelationsinstanzen enthält. Mit Hilfe der Kreuzvalidierung argumentieren wir, dass der Standardtestausschnitt der PDTB zu klein ist um daraus Schlussfolgerungen zu ziehen. Mit mehr Teststichproben in der Kreuzvalidierung würden wir zu anderen Schlussfolgerungen darüber kommen, ob ein Merkmal für diese Aufgabe generell vorteilhaft ist oder nicht, insbesondere wenn wir einen relativ großen Labelsatz verwenden. Wenn wir nur unseren kleinen Standardtestsatz herausstellen, laufen wir Gefahr, falsche Schlüsse darüber zu ziehen, welche Merkmale hilfreich sind. Zweitens schlagen wir einen einfachen Ansatz zur automatischen Extraktion von Samples impliziter Diskursrelationen aus mehrsprachigen Parallelkorpora durch Rückübersetzung vor. Er ist durch den Explikationsprozess motiviert, wenn Menschen einen Text übersetzen. Nach der Rückübersetzung aus den Zielsprachen ist es für den Diskursparser leicht, diejenigen Beispiele zu identifizieren, die ursprünglich implizit, in den Rückübersetzungen aber explizit enthalten sind. Da diese zusätzlichen Daten im Trainingsset enthalten sind, zeigen die Experimente signifikante Verbesserungen in verschiedenen Situationen. Wir verwenden zunächst nur französisch-englische Paare und haben keine Kontrolle über die Qualität und konzentrieren uns meist auf die satzinternen Relationen. Um diese Fragen in Angriff zu nehmen, erweitern wir die Idee später mit mehr Vorverarbeitungsschritten und mehr Sprachpaaren. Mit den Mehrheitsentscheidungen aus verschiedenen Sprachpaaren sind die gemappten impliziten Labels zuverlässiger. Schließlich ist auch eine bessere Kodierfähigkeit von entscheidender Bedeutung für die Verbesserung der Klassifizierungsleistung. Wir schlagen ein neues Modell vor, das aus einem Klassifikator und einem Sequenz-zu-Sequenz-Modell besteht. Neben der korrekten Vorhersage des Labels werden sie auch darauf trainiert, eine Repräsentation der Diskursrelationsargumente zu erzeugen, indem sie versuchen, die Argumente einschließlich eines geeigneten impliziten Konnektivs vorherzusagen. Die neuartige sekundäre Aufgabe zwingt die interne Repräsentation dazu, die Semantik der Relationsargumente vollständiger zu kodieren und eine feinkörnigere Klassifikation vorzunehmen. Um das allgemeine Wissen in Kontexten weiter zu erfassen, setzen wir auch ein Gedächtnisnetzwerk ein, um eine explizite Kontextrepräsentation von Trainingsbeispielen für Kontexte zu erhalten. Für jede Testinstanz erzeugen wir durch gewichtetes Lesen des Gedächtnisses einen Wissensvektor. Wir evaluieren das vorgeschlagene Modell unter verschiedenen Bedingungen und die Ergebnisse zeigen, dass das Modell mit dem Speichernetzwerk die Vorhersage von Diskursrelationen erleichtern kann, indem es Beispiele auswählt, die eine ähnliche semantische Repräsentation und Diskursrelationen aufweisen. Auch wenn ein besseres Verständnis, eine Kodierung und semantische Interpretation für die Aufgabe der impliziten Diskursrelationsklassifikation unerlässlich und nützlich sind, so leistet sie doch nur einen Teil der Arbeit. Ein guter impliziter Diskursrelationsklassifikator sollte sich auch der bevorstehenden Ereignisse, Ursachen, Folgen usw. bewusst sein, um die Diskurserwartung in die Satzdarstellungen zu kodieren. Mit Hilfe des kürzlich vorgeschlagenen BERT-Modells versuchen wir herauszufinden, ob es für die Aufgabe vorteilhaft ist, den richtigen nächsten Satz zu haben oder nicht. Die experimentellen Ergebnisse zeigen, dass das Entfernen der Aufgabe zur Vorhersage des nächsten Satzes die Leistung sowohl innerhalb der Domäne als auch domänenübergreifend stark beeinträchtigt. Die begrenzte Fähigkeit von BioBERT, domänenspezifisches Wissen, d.h. Entitätsinformationen, Entitätsbeziehungen etc. zu erlernen, motiviert uns, externes Wissen in die vortrainierten Sprachmodelle zu integrieren. Wir schlagen eine unüberwachte Methode vor, bei der Information-Retrieval-System und Wissensgraphen-Techniken verwendet werden, mit der Annahme, dass, wenn zwei Instanzen ähnliche Entitäten in beiden relationalen Argumenten teilen, die Wahrscheinlichkeit groß ist, dass sie die gleiche oder eine ähnliche Diskursrelation haben. Der Ansatz erzielt vergleichbare Ergebnisse auf BioDRB, verglichen mit Baselinemodellen. Anschließend verwenden wir die extrahierten relevanten Entitäten zur Verbesserung des vortrainierten Modells K-BERT, um die Bedeutung der Argumente besser zu kodieren und das ursprüngliche BERT und BioBERT mit einer Genauigkeit von 6,5% bzw. 2% zu übertreffen. Zusammenfassend trägt diese Dissertation dazu bei, das Problem des Datenengpasses bei der impliziten Diskursrelationsklassifikation anzugehen, und schlägt entsprechende Ansätze in verschiedenen Aspekten vor, u.a. die Darstellung des begrenzten Datenproblems und der Risiken bei der Schlussfolgerung daraus; die Erfassung automatisch annotierter Daten durch den Explikationsprozess während der manuellen Übersetzung zwischen Englisch und anderen Sprachen; eine bessere Repräsentation von Diskursrelationsargumenten; Entity-Enhancement mit einer unüberwachten Methode und einem vortrainierten Sprachmodell.2

@phdthesis{Shi_Diss_2020,
title = {Addressing the data bottleneck in implicit discourse relation classification},
author = {Wei Shi},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/30143},
doi = {https://doi.org/https://dx.doi.org/10.22028/D291-32711},
year = {2020},
date = {2020},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {When humans comprehend language, their interpretation consists of more than just the sum of the content of the sentences. Additional logic and semantic links (known as coherence relations or discourse relations) are inferred between sentences/clauses in the text. The identification of discourse relations is beneficial for various NLP applications such as question-answering, summarization, machine translation, information extraction, etc. Discourse relations are categorized into implicit and explicit discourse relations depending on whether there is an explicit discourse marker between the arguments. In this thesis, we mainly focus on the implicit discourse relation classification, given that with the explicit markers acting as informative cues, the explicit relations are relatively easier to identify for machines. The recent neural network-based approaches in particular suffer from insufficient training (and test) data. As shown in Chapter 3 of this thesis, we start out by showing to what extent the limited data size is a problem in implicit discourse relation classification and propose data augmentation methods with the help of cross-lingual data. And then we propose several approaches for better exploiting and encoding various types of existing data in the discourse relation classification task. Most of the existing machine learning methods train on sections 2-21 of the PDTB and test on section 23, which only includes a total of less than 800 implicit discourse relation instances. With the help of cross validation, we argue that the standard test section of the PDTB is too small to draw conclusions upon. With more test samples in the cross validation, we would come to very different conclusions about whether a feature is generally useful. Second, we propose a simple approach to automatically extract samples of implicit discourse relations from multilingual parallel corpus via back-translation. After back-translating from target languages, it is easy for the discourse parser to identify those examples that are originally implicit but explicit in the back-translations. Having those additional data in the training set, the experiments show significant improvements on different settings. Finally, having better encoding ability is also of crucial importance in terms of improving classification performance. We propose different methods including a sequence-to-sequence neural network and a memory component to help have a better representation of the arguments. We also show that having the correct next sentence is beneficial for the task within and across domains, with the help of the BERT (Devlin et al., 2019) model. When it comes to a new domain, it is beneficial to integrate external domain-specific knowledge. In Chapter 8, we show that with the entity-enhancement, the performance on BioDRB is improved significantly, comparing with other BERT-based methods. In sum, the studies reported in this dissertation contribute to addressing the data bottleneck problem in implicit discourse relation classification and propose corresponding approaches that achieve 54.82% and 69.57% on PDTB and BioDRB respectively.


Wenn Menschen Sprache verstehen, besteht ihre Interpretation aus mehr als nur der Summe des Inhalts der S{\"a}tze. Zwischen S{\"a}tzen im Text werden zus{\"a}tzliche logische und semantische Verkn{\"u}pfungen (sogenannte Koh{\"a}renzrelationen oder Diskursrelationen) hergeleitet. Die Identifizierung von Diskursrelationen ist f{\"u}r verschiedene NLP-Anwendungen wie Frage- Antwort, Zusammenfassung, maschinelle {\"U}bersetzung, Informationsextraktion usw. von Vorteil. Diskursrelationen werden in implizite und explizite Diskursrelationen unterteilt, je nachdem, ob es eine explizite Diskursrelationen zwischen den Argumenten gibt. In dieser Arbeit konzentrieren wir uns haupts{\"a}chlich auf die Klassifizierung der impliziten Diskursrelationen, da die expliziten Marker als hilfreiche Hinweise dienen und die expliziten Beziehungen f{\"u}r Maschinen relativ leicht zu identifizieren sind. Es wurden verschiedene Ans{\"a}tze vorgeschlagen, die bei der impliziten Diskursrelationsklassifikation beeindruckende Ergebnisse erzielt haben. Die meisten von ihnen leiden jedoch darunter, dass die Daten f{\"u}r auf neuronalen Netzen basierende Methoden unzureichend sind. In dieser Arbeit gehen wir zun{\"a}chst auf das Problem begrenzter Daten bei dieser Aufgabe ein und schlagen dann Methoden zur Datenanreicherung mit Hilfe von sprach{\"u}bergreifenden Daten vor. Zuletzt schlagen wir mehrere Methoden vor, um die Argumente aus verschiedenen Aspekten besser kodieren zu k{\"o}nnen. Die meisten der existierenden Methoden des maschinellen Lernens werden auf den Abschnitten 2-21 der PDTB trainiert und auf dem Abschnitt 23 getestet, der insgesamt nur weniger als 800 implizite Diskursrelationsinstanzen enth{\"a}lt. Mit Hilfe der Kreuzvalidierung argumentieren wir, dass der Standardtestausschnitt der PDTB zu klein ist um daraus Schlussfolgerungen zu ziehen. Mit mehr Teststichproben in der Kreuzvalidierung w{\"u}rden wir zu anderen Schlussfolgerungen dar{\"u}ber kommen, ob ein Merkmal f{\"u}r diese Aufgabe generell vorteilhaft ist oder nicht, insbesondere wenn wir einen relativ gro{\ss}en Labelsatz verwenden. Wenn wir nur unseren kleinen Standardtestsatz herausstellen, laufen wir Gefahr, falsche Schl{\"u}sse dar{\"u}ber zu ziehen, welche Merkmale hilfreich sind. Zweitens schlagen wir einen einfachen Ansatz zur automatischen Extraktion von Samples impliziter Diskursrelationen aus mehrsprachigen Parallelkorpora durch R{\"u}ck{\"u}bersetzung vor. Er ist durch den Explikationsprozess motiviert, wenn Menschen einen Text {\"u}bersetzen. Nach der R{\"u}ck{\"u}bersetzung aus den Zielsprachen ist es f{\"u}r den Diskursparser leicht, diejenigen Beispiele zu identifizieren, die urspr{\"u}nglich implizit, in den R{\"u}ck{\"u}bersetzungen aber explizit enthalten sind. Da diese zus{\"a}tzlichen Daten im Trainingsset enthalten sind, zeigen die Experimente signifikante Verbesserungen in verschiedenen Situationen. Wir verwenden zun{\"a}chst nur franz{\"o}sisch-englische Paare und haben keine Kontrolle {\"u}ber die Qualit{\"a}t und konzentrieren uns meist auf die satzinternen Relationen. Um diese Fragen in Angriff zu nehmen, erweitern wir die Idee sp{\"a}ter mit mehr Vorverarbeitungsschritten und mehr Sprachpaaren. Mit den Mehrheitsentscheidungen aus verschiedenen Sprachpaaren sind die gemappten impliziten Labels zuverl{\"a}ssiger. Schlie{\ss}lich ist auch eine bessere Kodierf{\"a}higkeit von entscheidender Bedeutung f{\"u}r die Verbesserung der Klassifizierungsleistung. Wir schlagen ein neues Modell vor, das aus einem Klassifikator und einem Sequenz-zu-Sequenz-Modell besteht. Neben der korrekten Vorhersage des Labels werden sie auch darauf trainiert, eine Repr{\"a}sentation der Diskursrelationsargumente zu erzeugen, indem sie versuchen, die Argumente einschlie{\ss}lich eines geeigneten impliziten Konnektivs vorherzusagen. Die neuartige sekund{\"a}re Aufgabe zwingt die interne Repr{\"a}sentation dazu, die Semantik der Relationsargumente vollst{\"a}ndiger zu kodieren und eine feink{\"o}rnigere Klassifikation vorzunehmen. Um das allgemeine Wissen in Kontexten weiter zu erfassen, setzen wir auch ein Ged{\"a}chtnisnetzwerk ein, um eine explizite Kontextrepr{\"a}sentation von Trainingsbeispielen f{\"u}r Kontexte zu erhalten. F{\"u}r jede Testinstanz erzeugen wir durch gewichtetes Lesen des Ged{\"a}chtnisses einen Wissensvektor. Wir evaluieren das vorgeschlagene Modell unter verschiedenen Bedingungen und die Ergebnisse zeigen, dass das Modell mit dem Speichernetzwerk die Vorhersage von Diskursrelationen erleichtern kann, indem es Beispiele ausw{\"a}hlt, die eine {\"a}hnliche semantische Repr{\"a}sentation und Diskursrelationen aufweisen. Auch wenn ein besseres Verst{\"a}ndnis, eine Kodierung und semantische Interpretation f{\"u}r die Aufgabe der impliziten Diskursrelationsklassifikation unerl{\"a}sslich und n{\"u}tzlich sind, so leistet sie doch nur einen Teil der Arbeit. Ein guter impliziter Diskursrelationsklassifikator sollte sich auch der bevorstehenden Ereignisse, Ursachen, Folgen usw. bewusst sein, um die Diskurserwartung in die Satzdarstellungen zu kodieren. Mit Hilfe des k{\"u}rzlich vorgeschlagenen BERT-Modells versuchen wir herauszufinden, ob es f{\"u}r die Aufgabe vorteilhaft ist, den richtigen n{\"a}chsten Satz zu haben oder nicht. Die experimentellen Ergebnisse zeigen, dass das Entfernen der Aufgabe zur Vorhersage des n{\"a}chsten Satzes die Leistung sowohl innerhalb der Dom{\"a}ne als auch dom{\"a}nen{\"u}bergreifend stark beeintr{\"a}chtigt. Die begrenzte F{\"a}higkeit von BioBERT, dom{\"a}nenspezifisches Wissen, d.h. Entit{\"a}tsinformationen, Entit{\"a}tsbeziehungen etc. zu erlernen, motiviert uns, externes Wissen in die vortrainierten Sprachmodelle zu integrieren. Wir schlagen eine un{\"u}berwachte Methode vor, bei der Information-Retrieval-System und Wissensgraphen-Techniken verwendet werden, mit der Annahme, dass, wenn zwei Instanzen {\"a}hnliche Entit{\"a}ten in beiden relationalen Argumenten teilen, die Wahrscheinlichkeit gro{\ss} ist, dass sie die gleiche oder eine {\"a}hnliche Diskursrelation haben. Der Ansatz erzielt vergleichbare Ergebnisse auf BioDRB, verglichen mit Baselinemodellen. Anschlie{\ss}end verwenden wir die extrahierten relevanten Entit{\"a}ten zur Verbesserung des vortrainierten Modells K-BERT, um die Bedeutung der Argumente besser zu kodieren und das urspr{\"u}ngliche BERT und BioBERT mit einer Genauigkeit von 6,5% bzw. 2% zu {\"u}bertreffen. Zusammenfassend tr{\"a}gt diese Dissertation dazu bei, das Problem des Datenengpasses bei der impliziten Diskursrelationsklassifikation anzugehen, und schl{\"a}gt entsprechende Ans{\"a}tze in verschiedenen Aspekten vor, u.a. die Darstellung des begrenzten Datenproblems und der Risiken bei der Schlussfolgerung daraus; die Erfassung automatisch annotierter Daten durch den Explikationsprozess w{\"a}hrend der manuellen {\"U}bersetzung zwischen Englisch und anderen Sprachen; eine bessere Repr{\"a}sentation von Diskursrelationsargumenten; Entity-Enhancement mit einer un{\"u}berwachten Methode und einem vortrainierten Sprachmodell.2},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B2

Mosbach, Marius; Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Abdullah, Badr M.; Klakow, Dietrich

A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English Inproceedings

Proceedings of the 28th International Conference on Computational Linguistics, pp. 771-787, 2020.

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

@inproceedings{Mosbach2020,
title = {A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English},
author = {Marius Mosbach and Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Badr M. Abdullah and Dietrich Klakow},
url = {https://aclanthology.org/2020.coling-main.67/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
pages = {771-787},
abstract = {Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4 C4

Juzek, Tom; Krielke, Marie-Pauline; Teich, Elke

Exploring diachronic syntactic shifts with dependency length: the case of scientific English Inproceedings

Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), Association for Computational Linguistics, pp. 109-119, Barcelona, Spain (Online), 2020.

We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.

@inproceedings{juzek-etal-2020-exploring,
title = {Exploring diachronic syntactic shifts with dependency length: the case of scientific English},
author = {Tom Juzek and Marie-Pauline Krielke and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.udw-1.13},
year = {2020},
date = {2020},
booktitle = {Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)},
pages = {109-119},
publisher = {Association for Computational Linguistics},
address = {Barcelona, Spain (Online)},
abstract = {We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Teich, Elke

Language variation and change: A communicative perspective Miscellaneous

Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS 2020, Hamburg, 2020.

It is widely acknowledged that language use and language structure are closely interlinked, linguistic structure emerging from language use (Bybee & Hopper 2001). Language use, in turn, is characterized by variation; in fact, speakers’ ability to adapt to changing contexts is a prerequisite for language to be functional (Weinreich et al. 1968).

Taking the perspective of rational communication, in my talk I will revisit some core questions of diachronic linguistic change: Why does a change happen? Which features are involved in change? How does change proceed? What are the eff ects of change? Recent work on online human language use reveals that speakers try to optimize their linguistic productions by encoding their messages with uniform information density (see Crocker et al. 2016 for an overview). Here, a major determinant in linguistic choice is predictability in context. Predictability in context is commonly represented by information content measured in bits (Shannon information): The more predictable a linguistic unit (e.g. word) is in a given context, the fewer bits are needed for encoding and the shorter its linguistic encoding may be (and vice versa, the more “surprising” a unit is in a given context, the more bits are needed for encoding and the more explicit its encoding tends to be). In this view, one major function of linguistic variation is to modulate information content so as to optimize message transmission.

In my talk, I apply this perspective to diachronic linguistic change. I show that speakers’ continuous adaptation to changing contextual conditions pushes towards linguistic innovation and results in temporary, high levels of expressivity, but the concern for maintaining communicative function pulls towards convergence and results in conventionalization. The diachronic scenario I discuss is mid-term change (200–250 years) in English in the late Modern period, focusing on the discourse domain of science (Degaetano-Ortlieb & Teich 2019). In terms of methods, I use computational language models to estimate predictability in context; and to assess diachronic change, I apply selected measures of information content, including entropy and surprisal.

@miscellaneous{Teich2020a,
title = {Language variation and change: A communicative perspective},
author = {Elke Teich},
url = {https://www.zfs.uni-hamburg.de/en/dgfs2020/programm/keynotes/elke-teich.html},
year = {2020},
date = {2020-11-04},
booktitle = {Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft, DGfS 2020},
address = {Hamburg},
abstract = {It is widely acknowledged that language use and language structure are closely interlinked, linguistic structure emerging from language use (Bybee & Hopper 2001). Language use, in turn, is characterized by variation; in fact, speakers’ ability to adapt to changing contexts is a prerequisite for language to be functional (Weinreich et al. 1968). Taking the perspective of rational communication, in my talk I will revisit some core questions of diachronic linguistic change: Why does a change happen? Which features are involved in change? How does change proceed? What are the eff ects of change? Recent work on online human language use reveals that speakers try to optimize their linguistic productions by encoding their messages with uniform information density (see Crocker et al. 2016 for an overview). Here, a major determinant in linguistic choice is predictability in context. Predictability in context is commonly represented by information content measured in bits (Shannon information): The more predictable a linguistic unit (e.g. word) is in a given context, the fewer bits are needed for encoding and the shorter its linguistic encoding may be (and vice versa, the more “surprising” a unit is in a given context, the more bits are needed for encoding and the more explicit its encoding tends to be). In this view, one major function of linguistic variation is to modulate information content so as to optimize message transmission. In my talk, I apply this perspective to diachronic linguistic change. I show that speakers’ continuous adaptation to changing contextual conditions pushes towards linguistic innovation and results in temporary, high levels of expressivity, but the concern for maintaining communicative function pulls towards convergence and results in conventionalization. The diachronic scenario I discuss is mid-term change (200–250 years) in English in the late Modern period, focusing on the discourse domain of science (Degaetano-Ortlieb & Teich 2019). In terms of methods, I use computational language models to estimate predictability in context; and to assess diachronic change, I apply selected measures of information content, including entropy and surprisal.},
note = {Key note},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Ortmann, Katrin; Dipper, Stefanie

Automatic Orality Identification in Historical Texts Inproceedings

Proceedings of The 12th Language Resources and Evaluation Conference (LREC), European Language Resources Association, pp. 1293-1302, Marseille, France, 2020.

Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages.

We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.

@inproceedings{Ortmann2020,
title = {Automatic Orality Identification in Historical Texts},
author = {Katrin Ortmann and Stefanie Dipper},
url = {https://www.aclweb.org/anthology/2020.lrec-1.162/},
year = {2020},
date = {2020},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC)},
pages = {1293-1302},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages. We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Stenger, Irina; Avgustinova, Tania

How intelligible is spoken Bulgarian for Russian native speakers in an intercomprehension scenario? Inproceedings

Micheva, Vanya et al. (Ed.): Proceedings of the International Annual Conference of the Institute for Bulgarian Language, 2, pp. 142-151, Sofia, Bulgaria, 2020.

In a web-based experiment, Bulgarian audio stimuli in the form of recorded isolated words are presented to Russian native speakers who are required to write a suitable Russian translation. The degree of intelligibility, as revealed by the cognate guessing task, is relatively high for this pair of languages. We correlate the obtained intercomprehension scores with established linguistic factors in order to determine their influence on the cross-linguistic spoken word recognition. A detailed error analysis focuses on sound correspondences that cause translation problems in such an intercomprehension scenario.

@inproceedings{Stenger2020b,
title = {How intelligible is spoken Bulgarian for Russian native speakers in an intercomprehension scenario?},
author = {Irina Stenger and Tania Avgustinova},
editor = {Vanya Micheva et al.},
year = {2020},
date = {2020},
booktitle = {Proceedings of the International Annual Conference of the Institute for Bulgarian Language},
pages = {142-151},
address = {Sofia, Bulgaria},
abstract = {In a web-based experiment, Bulgarian audio stimuli in the form of recorded isolated words are presented to Russian native speakers who are required to write a suitable Russian translation. The degree of intelligibility, as revealed by the cognate guessing task, is relatively high for this pair of languages. We correlate the obtained intercomprehension scores with established linguistic factors in order to determine their influence on the cross-linguistic spoken word recognition. A detailed error analysis focuses on sound correspondences that cause translation problems in such an intercomprehension scenario.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Avgustinova, Tania; Stenger, Irina

Russian-Bulgarian mutual intelligibility in light of linguistic and statistical models of Slavic receptive multilingualism [Russko-bolgarskaja vzaimoponjatnost’ v svete lingvističeskich i statističeskich modelej slavjanskoj receptivnoj mnogojazyčnocsti] Book Chapter

Marti, Roland; Pognan, Patrice; Schlamberger Brezar, Mojca (Ed.): University Press, Faculty of Arts, pp. 85-99, Ljubljana, Slovenia, 2020.

Computational modelling of the observed mutual intelligibility of Slavic languages unavoid-ably requires systematic integration of classical Slavistics knowledge from comparative his-torical grammar and traditional contrastive description of language pairs. The phenomenon of intercomprehension is quite intuitive: speakers of a given language L1 understand another closely related language (variety) L2 without being able to use the latter productively, i.e. for speaking or writing.

This specific mode of using the human linguistic competence manifests itself as receptive multilingualism. The degree of mutual understanding of genetically close-ly related languages, such as Bulgarian and Russian, corresponds to objectively measurable distances at different linguistic levels. The common Slavic basis and the comparative-syn-chronous perspective allow us to reveal Bulgarian-Russian linguistic affinity with regard to spelling, vocabulary and grammar.

@inbook{Avgustinova2020,
title = {Russian-Bulgarian mutual intelligibility in light of linguistic and statistical models of Slavic receptive multilingualism [Russko-bolgarskaja vzaimoponjatnost’ v svete lingvisti{\v{c}eskich i statisti{\v{c}eskich modelej slavjanskoj receptivnoj mnogojazy{\v{c}nocsti]},
author = {Tania Avgustinova and Irina Stenger},
editor = {Roland Marti and Patrice Pognan and Mojca Schlamberger Brezar},
url = {https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/226/326/5284-1},
year = {2020},
date = {2020},
pages = {85-99},
publisher = {University Press, Faculty of Arts},
address = {Ljubljana, Slovenia},
abstract = {Computational modelling of the observed mutual intelligibility of Slavic languages unavoid-ably requires systematic integration of classical Slavistics knowledge from comparative his-torical grammar and traditional contrastive description of language pairs. The phenomenon of intercomprehension is quite intuitive: speakers of a given language L1 understand another closely related language (variety) L2 without being able to use the latter productively, i.e. for speaking or writing. This specific mode of using the human linguistic competence manifests itself as receptive multilingualism. The degree of mutual understanding of genetically close-ly related languages, such as Bulgarian and Russian, corresponds to objectively measurable distances at different linguistic levels. The common Slavic basis and the comparative-syn-chronous perspective allow us to reveal Bulgarian-Russian linguistic affinity with regard to spelling, vocabulary and grammar.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina; Avgustinova, Tania

Visual vs. auditory perception of Bulgarian stimuli by Russian native speakers Inproceedings

P. Selegej, Vladimir et al. (Ed.): Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ‘Dialogue’, pp. 684 - 695, 2020.

This study contributes to a better understanding of receptive multilingualism by determining similarities and differences in successful processing of written and spoken cognate words in an unknown but (closely) related language. We investigate two Slavic languages with regard to their mutual intelligibility. The current focus is on the recognition of isolated Bulgarian words by Russian native speakers in a cognate guessing task, considering both written and audio stimuli.

The experimentally obtained intercomprehension scores show a generally high degree of intelligibility of Bulgarian cognates to Russian subjects, as well as processing difficulties in case of visual vs. auditory perception. In search of an explanation, we examine the linguistic factors that can contribute to various degrees of written and spoken word intelligibility. The intercomprehension scores obtained in the online word translation experiments are correlated with (i) the identical and mismatched correspondences on the orthographic and phonetic level, (ii) the word length of the stimuli, and (iii) the frequency of Russian cognates. Additionally we validate two measuring methods: the Levenshtein distance and the word adaptation surprisal as potential predictors of the word intelligibility in reading and oral intercomprehension.

@inproceedings{Stenger2020b,
title = {Visual vs. auditory perception of Bulgarian stimuli by Russian native speakers},
author = {Irina Stenger and Tania Avgustinova},
editor = {Vladimir P. Selegej et al.},
url = {http://www.dialog-21.ru/media/4962/stengeriplusavgustinovat-045.pdf},
year = {2020},
date = {2020},
booktitle = {Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ‘Dialogue’},
pages = {684 - 695},
abstract = {This study contributes to a better understanding of receptive multilingualism by determining similarities and differences in successful processing of written and spoken cognate words in an unknown but (closely) related language. We investigate two Slavic languages with regard to their mutual intelligibility. The current focus is on the recognition of isolated Bulgarian words by Russian native speakers in a cognate guessing task, considering both written and audio stimuli. The experimentally obtained intercomprehension scores show a generally high degree of intelligibility of Bulgarian cognates to Russian subjects, as well as processing difficulties in case of visual vs. auditory perception. In search of an explanation, we examine the linguistic factors that can contribute to various degrees of written and spoken word intelligibility. The intercomprehension scores obtained in the online word translation experiments are correlated with (i) the identical and mismatched correspondences on the orthographic and phonetic level, (ii) the word length of the stimuli, and (iii) the frequency of Russian cognates. Additionally we validate two measuring methods: the Levenshtein distance and the word adaptation surprisal as potential predictors of the word intelligibility in reading and oral intercomprehension.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Avgustinova, Tania; Jágrová, Klára; Stenger, Irina

The INCOMSLAV Platform: Experimental Website with Integrated Methods for Measuring Linguistic Distances and Asymmetries in Receptive Multilingualism Inproceedings

Fiumara, James; Cieri, Christopher; Liberman, Mark; Callison-Burch, Chris (Ed.): LREC 2020 Workshop Language Resources and Evaluation Conference 11-16 May 2020, Citizen Linguistics in Language Resource Development (CLLRD 2020), Peter Lang, pp. 483-500, 2020.

We report on a web-based resource for conducting intercomprehension experiments with native speakers of Slavic languages and present our methods for measuring linguistic distances and asymmetries in receptive multilingualism. Through a website which serves as a platform for online testing, a large number of participants with different linguistic backgrounds can be targeted. A statistical language model is used to measure information density and to gauge how language users master various degrees of (un)intelligibilty. The key idea is that intercomprehension should be better when the model adapted for understanding the unknown language exhibits relatively low average distance and surprisal. All obtained intelligibility scores together with distance and asymmetry measures for the different language pairs and processing directions are made available as an integrated online resource in the form of a Slavic intercomprehension matrix (SlavMatrix).

@inproceedings{Stenger2020b,
title = {The INCOMSLAV Platform: Experimental Website with Integrated Methods for Measuring Linguistic Distances and Asymmetries in Receptive Multilingualism},
author = {Tania Avgustinova and Kl{\'a}ra J{\'a}grov{\'a} and Irina Stenger},
editor = {James Fiumara and Christopher Cieri and Mark Liberman and Chris Callison-Burch},
url = {https://aclanthology.org/2020.cllrd-1.6/},
doi = {https://doi.org/10.3726/978-3-653-07147-4},
year = {2020},
date = {2020},
booktitle = {LREC 2020 Workshop Language Resources and Evaluation Conference 11-16 May 2020, Citizen Linguistics in Language Resource Development (CLLRD 2020)},
pages = {483-500},
publisher = {Peter Lang},
abstract = {We report on a web-based resource for conducting intercomprehension experiments with native speakers of Slavic languages and present our methods for measuring linguistic distances and asymmetries in receptive multilingualism. Through a website which serves as a platform for online testing, a large number of participants with different linguistic backgrounds can be targeted. A statistical language model is used to measure information density and to gauge how language users master various degrees of (un)intelligibilty. The key idea is that intercomprehension should be better when the model adapted for understanding the unknown language exhibits relatively low average distance and surprisal. All obtained intelligibility scores together with distance and asymmetry measures for the different language pairs and processing directions are made available as an integrated online resource in the form of a Slavic intercomprehension matrix (SlavMatrix).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina; Jágrová, Klára; Fischer, Andrea; Avgustinova, Tania

“Reading Polish with Czech Eyes” or “How Russian Can a Bulgarian Text Be?”: Orthographic Differences as an Experimental Variable in Slavic Intercomprehension Incollection

Radeva-Bork, Teodora; Kosta, Peter (Ed.): Current Developments in Slavic Linguistics. Twenty Years After (based on selected papers from FDSL 11), Peter Lang, pp. 483-500, 2020.

@incollection{Stenger2020,
title = {“Reading Polish with Czech Eyes” or “How Russian Can a Bulgarian Text Be?”: Orthographic Differences as an Experimental Variable in Slavic Intercomprehension},
author = {Irina Stenger and Kl{\'a}ra J{\'a}grov{\'a} and Andrea Fischer and Tania Avgustinova},
editor = {Teodora Radeva-Bork and Peter Kosta},
url = {https://www.peterlang.com/view/title/19540},
doi = {https://doi.org/10.3726/978-3-653-07147-4},
year = {2020},
date = {2020},
booktitle = {Current Developments in Slavic Linguistics. Twenty Years After (based on selected papers from FDSL 11)},
pages = {483-500},
publisher = {Peter Lang},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   C4

Jachmann, Torsten

The immediate influence of speaker gaze on situated speech comprehension : evidence from multiple ERP components PhD Thesis

Saarland University, Saarbruecken, Germany, 2020.

This thesis presents results from three ERP experiments on the influence of speaker gaze on listeners’ sentence comprehension with focus on the utilization of speaker gaze as part of the communicative signal. The first two experiments investigated whether speaker gaze was utilized in situated communication to form expectations about upcoming referents in an unfolding sentence. Participants were presented with a face performing gaze actions toward three objects surrounding it time aligned to utterances that compared two of the three objects.

Participants were asked to judge whether the sentence they heard was true given the provided scene. Gaze cues preceded the naming of the corresponding object by 800ms. The gaze cue preceding the mentioning of the second object was manipulated such that it was either Congruent, Incongruent or Uninformative (Averted toward an empty position in experiment 1 and Mutual (redirected toward the listener) in Experiment 2). The results showed that speaker gaze was used to form expectations about the unfolding sentence indicated by three observed ERP components that index different underlying mechanisms of language comprehension: an increased Phonological Mapping Negativity (PMN) was observed when an unexpected (Incongruent) or unpredictable (Uninformative) phoneme is encountered. The retrieval of a referent’s semantics was indexed by an N400 effect in response to referents following both Incongruent and Uninformative gaze. Additionally, an increased P600 response was present only for preceding Incongruent gaze, indexing the revision process of the mental representation of the situation. The involvement of these mechanisms has been supported by the findings of the third experiment, in which linguistic content was presented to serve as a predictive cue for subsequent speaker gaze. In this experiment the sentence structure enabled participants to anticipate upcoming referents based on the preceding linguistic content. Thus, gaze cues preceding the mentioning of the referent could also be anticipated.

The results showed the involvement of the same mechanisms as in the first two experiments on the referent itself, only when preceding gaze was absent. In the presence of object-directed gaze, while there were no longer significant effects on the referent itself, effects of semantic retrieval (N400) and integration with sentence meaning (P3b) were found on the gaze cue. Effects in the P3b (Gaze) and P600 (Referent) time-window further provided support for the presence of a mechanism of monitoring of the mental representation of the situation that subsumes the integration into that representation: A positive deflection was found whenever the communicative signal completed the mental representation such that an evaluation of that representation was possible. Taken together, the results provide support for the view that speaker gaze, in situated communication, is interpreted as part of the communicative signal and incrementally used to inform the mental representation of the situation simultaneously with the linguistic signal and that the mental representation is utilized to generate expectations about upcoming referents in an unfolding utterance.

@phdthesis{Jachmann2020,
title = {The immediate influence of speaker gaze on situated speech comprehension : evidence from multiple ERP components},
author = {Torsten Jachmann},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-313090},
doi = {https://doi.org/10.22028/D291-31309},
year = {2020},
date = {2020},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {This thesis presents results from three ERP experiments on the influence of speaker gaze on listeners’ sentence comprehension with focus on the utilization of speaker gaze as part of the communicative signal. The first two experiments investigated whether speaker gaze was utilized in situated communication to form expectations about upcoming referents in an unfolding sentence. Participants were presented with a face performing gaze actions toward three objects surrounding it time aligned to utterances that compared two of the three objects. Participants were asked to judge whether the sentence they heard was true given the provided scene. Gaze cues preceded the naming of the corresponding object by 800ms. The gaze cue preceding the mentioning of the second object was manipulated such that it was either Congruent, Incongruent or Uninformative (Averted toward an empty position in experiment 1 and Mutual (redirected toward the listener) in Experiment 2). The results showed that speaker gaze was used to form expectations about the unfolding sentence indicated by three observed ERP components that index different underlying mechanisms of language comprehension: an increased Phonological Mapping Negativity (PMN) was observed when an unexpected (Incongruent) or unpredictable (Uninformative) phoneme is encountered. The retrieval of a referent’s semantics was indexed by an N400 effect in response to referents following both Incongruent and Uninformative gaze. Additionally, an increased P600 response was present only for preceding Incongruent gaze, indexing the revision process of the mental representation of the situation. The involvement of these mechanisms has been supported by the findings of the third experiment, in which linguistic content was presented to serve as a predictive cue for subsequent speaker gaze. In this experiment the sentence structure enabled participants to anticipate upcoming referents based on the preceding linguistic content. Thus, gaze cues preceding the mentioning of the referent could also be anticipated. The results showed the involvement of the same mechanisms as in the first two experiments on the referent itself, only when preceding gaze was absent. In the presence of object-directed gaze, while there were no longer significant effects on the referent itself, effects of semantic retrieval (N400) and integration with sentence meaning (P3b) were found on the gaze cue. Effects in the P3b (Gaze) and P600 (Referent) time-window further provided support for the presence of a mechanism of monitoring of the mental representation of the situation that subsumes the integration into that representation: A positive deflection was found whenever the communicative signal completed the mental representation such that an evaluation of that representation was possible. Taken together, the results provide support for the view that speaker gaze, in situated communication, is interpreted as part of the communicative signal and incrementally used to inform the mental representation of the situation simultaneously with the linguistic signal and that the mental representation is utilized to generate expectations about upcoming referents in an unfolding utterance.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C3

Meier, David; Andreeva, Bistra

Einflussfaktoren auf die Wahrnehmung von Prominenz im natürlichen Dialog Inproceedings

Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz , pp. 257-264, Magdeburg, 2020.

Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, nämlich die Phonologie, der Diskurskontext und das Wissen über den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu überprüfen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem natürlichsprachlichen Dialog vergleichbar ist. Für die Studie wurden 144 realisierte Sätze eines einzelnen männlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen größeren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verfügung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachlässigbare Effekte.

@inproceedings{Meier2020,
title = {Einflussfaktoren auf die Wahrnehmung von Prominenz im nat{\"u}rlichen Dialog},
author = {David Meier and Bistra Andreeva},
url = {https://www.essv.de/paper.php?id=465},
year = {2020},
date = {2020},
booktitle = {Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz},
pages = {257-264},
address = {Magdeburg},
abstract = {Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, n{\"a}mlich die Phonologie, der Diskurskontext und das Wissen {\"u}ber den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu {\"u}berpr{\"u}fen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem nat{\"u}rlichsprachlichen Dialog vergleichbar ist. F{\"u}r die Studie wurden 144 realisierte S{\"a}tze eines einzelnen m{\"a}nnlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen gr{\"o}{\ss}eren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verf{\"u}gung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachl{\"a}ssigbare Effekte.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Möbius, Bernd; Whang, James

Effects of surprisal and boundary strength on phrase-final lengthening Inproceedings

Proc. 10th International Conference on Speech Prosody 2020, pp. 146-150, 2020.

This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength.

ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.

@inproceedings{Andreeva2020,
title = {Effects of surprisal and boundary strength on phrase-final lengthening},
author = {Bistra Andreeva and Bernd M{\"o}bius andJames Whang},
url = {http://dx.doi.org/10.21437/SpeechProsody.2020-30},
year = {2020},
date = {2020-10-20},
booktitle = {Proc. 10th International Conference on Speech Prosody 2020},
pages = {146-150},
abstract = {This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength. ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Teich, Elke; Martínez Martínez, José; Karakanta, Alina

Translation, information theory and cognition Book Chapter

Alves, Fabio; Lykke Jakobsen, Arnt (Ed.): The Routledge Handbook of Translation and Cognition, Routledge, pp. 360-375, London, UK, 2020, ISBN 9781138037007.

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.

@inbook{Teich-etal2020-handbook,
title = {Translation, information theory and cognition},
author = {Elke Teich and Jos{\'e} Mart{\'i}nez Mart{\'i}nez and Alina Karakanta},
editor = {Fabio Alves and Arnt Lykke Jakobsen},
url = {https://www.taylorfrancis.com/chapters/edit/10.4324/9781315178127-24/translation-information-theory-cognition-elke-teich-josé-martínez-martínez-alina-karakanta},
year = {2020},
date = {2020},
booktitle = {The Routledge Handbook of Translation and Cognition},
isbn = {9781138037007},
pages = {360-375},
publisher = {Routledge},
address = {London, UK},
abstract = {

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.
},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Juzek, Tom; España-Bonet, Cristina; Dutta Chowdhury, Koel; van Genabith, Josef; Teich, Elke

How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech Inproceedings

The 17th International Workshop on Spoken Language Translation, Seattle, WA, United States, 2020.

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).

@inproceedings{Bizzoni2020,
title = {How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech},
author = {Yuri Bizzoni and Tom Juzek and Cristina Espa{\~n}a-Bonet and Koel Dutta Chowdhury and Josef van Genabith and Elke Teich},
url = {https://aclanthology.org/2020.iwslt-1.34/},
doi = {https://doi.org/10.18653/v1/2020.iwslt-1.34},
year = {2020},
date = {2020},
booktitle = {The 17th International Workshop on Spoken Language Translation},
address = {Seattle, WA, United States},
abstract = {Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B6 B7

Adelani, David; Hedderich, Michael; Zhu, Dawei; van Berg, Esther; Klakow, Dietrich

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá Miscellaneous

, 2020.

The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way.

Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios.

In this work, we perform named entity recognition for Hausa and Yorùbá, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier’s performance.

@miscellaneous{Adelani2020,
title = {Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùb{\'a}},
author = {David Adelani and Michael Hedderich and Dawei Zhu and Esther van Berg and Dietrich Klakow},
url = {https://arxiv.org/abs/2003.08370},
year = {2020},
date = {2020},
abstract = {The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way. Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios. In this work, we perform named entity recognition for Hausa and Yorùb{\'a}, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier's performance.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B4

Lemke, Tyll Robin; Schäfer, Lisa; Drenhaus, Heiner; Reich, Ingo

Script Knowledge Constrains Ellipses in Fragments - Evidence from Production Data and Language Modeling Inproceedings

Proceedings of the Society for Computation in Linguistics, 3, 2020.

We investigate the effect of script-based (Schank and Abelson 1977) extralinguistic context on the omission of words in fragments. Our data elicited with a production task show that predictable words are more often omitted than unpredictable ones, as predicted by the Uniform Information Density (UID) hypothesis (Levy and Jaeger, 2007).

We take into account effects of linguistic and extralinguistic context on predictability and propose a method for estimating the surprisal of words in presence of ellipsis. Our study extends previous evidence for UID in two ways: First, we show that not only local linguistic context, but also extralinguistic context determines the likelihood of omissions. Second, we find UID effects on the omission of content words.

@inproceedings{Lemke2020,
title = {Script Knowledge Constrains Ellipses in Fragments - Evidence from Production Data and Language Modeling},
author = {Tyll Robin Lemke and Lisa Sch{\"a}fer and Heiner Drenhaus and Ingo Reich},
url = {https://scholarworks.umass.edu/scil/vol3/iss1/45},
doi = {https://doi.org/https://doi.org/10.7275/mpby-zr74 },
year = {2020},
date = {2020},
booktitle = {Proceedings of the Society for Computation in Linguistics},
abstract = {We investigate the effect of script-based (Schank and Abelson 1977) extralinguistic context on the omission of words in fragments. Our data elicited with a production task show that predictable words are more often omitted than unpredictable ones, as predicted by the Uniform Information Density (UID) hypothesis (Levy and Jaeger, 2007). We take into account effects of linguistic and extralinguistic context on predictability and propose a method for estimating the surprisal of words in presence of ellipsis. Our study extends previous evidence for UID in two ways: First, we show that not only local linguistic context, but also extralinguistic context determines the likelihood of omissions. Second, we find UID effects on the omission of content words.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B3

Fischer, Stefan; Knappen, Jörg; Menzel, Katrin; Teich, Elke

The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study Inproceedings

Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, pp. 794-802, Marseille, France, 2020.

We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings.

The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases.

We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.

@inproceedings{fischer-EtAl:2020:LREC,
title = {The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study},
author = {Stefan Fischer and J{\"o}rg Knappen and Katrin Menzel and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.lrec-1.99/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
pages = {794-802},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Fankhauser, Peter; Teich, Elke

Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach Journal Article

Jurgens, David (Ed.): Frontiers in Artificial Intelligence, section Language and Computation, 2020.

We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665.

Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register).

Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.

@article{Bizzoni2020b,
title = {Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Peter Fankhauser and Elke Teich},
editor = {David Jurgens},
url = {https://www.frontiersin.org/articles/10.3389/frai.2020.00073/full},
doi = {https://doi.org/https://doi.org/10.3389/frai.2020.00073},
year = {2020},
date = {2020-10-18},
journal = {Frontiers in Artificial Intelligence, section Language and Computation},
abstract = {We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Wichlacz, Julia; Höller, Daniel; Torralba, Álvaro; Hoffmann, Jörg

Applying Monte-Carlo Tree Search in HTN Planning Inproceedings

Proceedings of the 13th International Symposium on Combinatorial Search (SoCS), AAAI Press, pp. 82-90, Vienna, Austria, 2020.

Search methods are useful in hierarchical task network (HTN) planning to make performance less dependent on the domain knowledge provided, and to minimize plan costs. Here we investigate Monte-Carlo tree search (MCTS) as a new algorithmic alternative in HTN planning. We implement combinations of MCTS with heuristic search in Panda. We furthermore investigate MCTS in JSHOP, to address lifted (non-grounded) planning, leveraging the fact that, in contrast to other search methods, MCTS does not require a grounded task representation. Our new methods yield coverage performance on par with the state of the art, but in addition can effectively minimize plan cost over time.

@inproceedings{Wichlacz20MCTSSOCS,
title = {Applying Monte-Carlo Tree Search in HTN Planning},
author = {Julia Wichlacz and Daniel H{\"o}ller and {\'A}lvaro Torralba and J{\"o}rg Hoffmann},
url = {https://ojs.aaai.org/index.php/SOCS/article/view/18538},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 13th International Symposium on Combinatorial Search (SoCS)},
pages = {82-90},
publisher = {AAAI Press},
address = {Vienna, Austria},
abstract = {Search methods are useful in hierarchical task network (HTN) planning to make performance less dependent on the domain knowledge provided, and to minimize plan costs. Here we investigate Monte-Carlo tree search (MCTS) as a new algorithmic alternative in HTN planning. We implement combinations of MCTS with heuristic search in Panda. We furthermore investigate MCTS in JSHOP, to address lifted (non-grounded) planning, leveraging the fact that, in contrast to other search methods, MCTS does not require a grounded task representation. Our new methods yield coverage performance on par with the state of the art, but in addition can effectively minimize plan cost over time.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A7

Successfully