Publications

Bizzoni, Yuri; Mosbach, Marius; Klakow, Dietrich; Degaetano-Ortlieb, Stefania

Some steps towards the generation of diachronic WordNets Inproceedings

Proceedings of the 22nd Nordic Conference on Computational Linguistics, Linköping University Electronic Press, Turku, Finland, 2019.

We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task.

Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.

@inproceedings{bizzoni-etal-2019-steps,
title = {Some steps towards the generation of diachronic WordNets},
author = {Yuri Bizzoni and Marius Mosbach and Dietrich Klakow and Stefania Degaetano-Ortlieb},
url = {https://www.aclweb.org/anthology/W19-6106},
year = {2019},
date = {2019-10-02},
booktitle = {Proceedings of the 22nd Nordic Conference on Computational Linguistics},
publisher = {Link{\"o}ping University Electronic Press},
address = {Turku, Finland},
abstract = {We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task. Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Mosbach, Marius; Stenger, Irina; Avgustinova, Tania; Klakow, Dietrich

incom.py - A Toolbox for Calculating Linguistic Distances and Asymmetries between Related Languages Inproceedings

Angelova, Galia; Mitkov, Ruslan; Nikolova, Ivelina; Temnikova, Irina (Ed.): Proceedings of Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2-4 September 2019, pp. 811-819, Varna, Bulgaria, 2019.

Languages may be differently distant from each other and their mutual intelligibility may be asymmetric. In this paper we introduce incom.py, a toolbox for calculating linguistic distances and asymmetries between related languages. incom.py allows linguist experts to quickly and easily perform statistical analyses and compare those with experimental results. We demonstrate the efficacy of incom.py in an incomprehension experiment on two Slavic languages: Bulgarian and Russian. Using incom.py we were able to validate three methods to measure linguistic distances and asymmetries: Levenshtein distance, word adaptation surprisal, and conditional entropy as predictors of success in a reading intercomprehension experiment.

@inproceedings{Mosbach2019,
title = {incom.py - A Toolbox for Calculating Linguistic Distances and Asymmetries between Related Languages},
author = {Marius Mosbach and Irina Stenger and Tania Avgustinova and Dietrich Klakow},
editor = {Galia Angelova and Ruslan Mitkov and Ivelina Nikolova and Irina Temnikova},
url = {https://aclanthology.org/R19-1094/},
doi = {https://doi.org/10.26615/978-954-452-056-4_094},
year = {2019},
date = {2019},
booktitle = {Proceedings of Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2-4 September 2019},
pages = {811-819},
address = {Varna, Bulgaria},
abstract = {Languages may be differently distant from each other and their mutual intelligibility may be asymmetric. In this paper we introduce incom.py, a toolbox for calculating linguistic distances and asymmetries between related languages. incom.py allows linguist experts to quickly and easily perform statistical analyses and compare those with experimental results. We demonstrate the efficacy of incom.py in an incomprehension experiment on two Slavic languages: Bulgarian and Russian. Using incom.py we were able to validate three methods to measure linguistic distances and asymmetries: Levenshtein distance, word adaptation surprisal, and conditional entropy as predictors of success in a reading intercomprehension experiment.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Avgustinova, Tania; Iomdin, Leonid

Towards a Typology of Microsyntactic Constructions Inproceedings

Corpas-Pastor, Gloria; Mitkov, Ruslan (Ed.): Computational and Corpus-Based Phraseology, Springer, Cham, pp. 15-30, 2019.

This contribution outlines an international research effort for creating a typology of syntactic idioms on the borderline of the dictionary and the grammar. Recent studies focusing on the adequate description of such units, especially for modern Russian, have resulted in two types of linguistic resources: a microsyntactic dictionary of Russian, and a microsyntactically annotated corpus of Russian texts. Our goal now is to discover to what extent the findings can be generalized cross-linguistically in order to create analogous multilingual resources. The initial work consists in constructing a typology of relevant phenomena. The empirical base is provided by closely related languages which are mutually intelligible to various degrees. We start by creating an inventory for this typology for four representative Slavic languages: Russian (East Slavic), Bulgarian (South Slavic), Polish and Czech (West Slavic). Our preliminary results show that the aim is attainable and can be of relevance to theoretical, comparative and applied linguistics as well as in NLP tasks.

@inproceedings{Avgustinova2019,
title = {Towards a Typology of Microsyntactic Constructions},
author = {Tania Avgustinova and Leonid Iomdin},
editor = {Gloria Corpas-Pastor and Ruslan Mitkov},
url = {https://link.springer.com/chapter/10.1007/978-3-030-30135-4_2},
year = {2019},
date = {2019-09-18},
booktitle = {Computational and Corpus-Based Phraseology},
pages = {15-30},
publisher = {Springer, Cham},
abstract = {This contribution outlines an international research effort for creating a typology of syntactic idioms on the borderline of the dictionary and the grammar. Recent studies focusing on the adequate description of such units, especially for modern Russian, have resulted in two types of linguistic resources: a microsyntactic dictionary of Russian, and a microsyntactically annotated corpus of Russian texts. Our goal now is to discover to what extent the findings can be generalized cross-linguistically in order to create analogous multilingual resources. The initial work consists in constructing a typology of relevant phenomena. The empirical base is provided by closely related languages which are mutually intelligible to various degrees. We start by creating an inventory for this typology for four representative Slavic languages: Russian (East Slavic), Bulgarian (South Slavic), Polish and Czech (West Slavic). Our preliminary results show that the aim is attainable and can be of relevance to theoretical, comparative and applied linguistics as well as in NLP tasks.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Bizzoni, Yuri; Teich, Elke

Analyzing variation in translation through neural semantic spaces Inproceedings

Special topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, BulgariaSpecial topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria, 2019.

We present an approach for exploring the lexical choice patterns in translation on the basis of word embeddings. Specifically, we are interested in variation in translation according to translation mode, i.e. (written) translation vs. (simultaneous) interpreting. While it might seem obvious that the outputs of the two translation modes differ, there are hardly any accounts of the summative linguistic effects of one vs. the other. To explore such effects at the lexical level, we propose a data-driven approach: using neural word embeddings (Word2Vec), we compare the bilingual semantic spaces emanating from source-totranslation and source-to-interpreting.

@inproceedings{Bizzoni2019,
title = {Analyzing variation in translation through neural semantic spaces},
author = {Yuri Bizzoni and Elke Teich},
url = {https://comparable.limsi.fr/bucc2019/Bizzoni_BUCC2019_paper1.pdf},
year = {2019},
date = {2019-08-30},
booktitle = {Special topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria},
address = {Varna, Bulgaria},
abstract = {We present an approach for exploring the lexical choice patterns in translation on the basis of word embeddings. Specifically, we are interested in variation in translation according to translation mode, i.e. (written) translation vs. (simultaneous) interpreting. While it might seem obvious that the outputs of the two translation modes differ, there are hardly any accounts of the summative linguistic effects of one vs. the other. To explore such effects at the lexical level, we propose a data-driven approach: using neural word embeddings (Word2Vec), we compare the bilingual semantic spaces emanating from source-totranslation and source-to-interpreting.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Yung, Frances Pik Yu; Scholman, Merel; Demberg, Vera

Crowdsourcing Discourse Relation Annotations by a Two-Step Connective Insertion Task Inproceedings

Annemarie and Zeyrek, Deniz and Hoek, Jet, Friedrich, (Ed.): Linguistic Annotation Workshop at ACL. LAW XIII 2019, pp. 16-25, Stroudsburg, PA, 2019, ISBN 978-1-950737-38-3.

The perspective of being able to crowd-source coherence relations bears the promise of acquiring annotations for new texts quickly, which could then increase the size and variety of discourse-annotated corpora. It would also open the avenue to answering new research questions: Collecting annotations from a larger number of individuals per instance would allow to investigate the distribution of inferred relations, and to study individual differences in coherence relation interpretation. However, annotating coherence relations with untrained workers is not trivial. We here propose a novel two-step annotation procedure, which extends an earlier method by Scholman and Demberg (2017a). In our approach, coherence relation labels are inferred from connectives that workers insert into the text. We show that the proposed method leads to replicable coherence annotations, and analyse the agreement between the obtained relation labels and annotations from PDTB and RSTDT on the same texts.

@inproceedings{Yung2019,
title = {Crowdsourcing Discourse Relation Annotations by a Two-Step Connective Insertion Task},
author = {Frances Pik Yu Yung and Merel Scholman and Vera Demberg},
editor = {Friedrich Annemarie and Zeyrek Deniz and Hoek Jet},
url = {https://aclanthology.org/W19-4003.pdf},
doi = {https://doi.org/10.22028/D291-30470},
year = {2019},
date = {2019-08-01},
isbn = {978-1-950737-38-3},
pages = {16-25},
publisher = {Linguistic Annotation Workshop at ACL. LAW XIII 2019},
address = {Stroudsburg, PA},
abstract = {The perspective of being able to crowd-source coherence relations bears the promise of acquiring annotations for new texts quickly, which could then increase the size and variety of discourse-annotated corpora. It would also open the avenue to answering new research questions: Collecting annotations from a larger number of individuals per instance would allow to investigate the distribution of inferred relations, and to study individual differences in coherence relation interpretation. However, annotating coherence relations with untrained workers is not trivial. We here propose a novel two-step annotation procedure, which extends an earlier method by Scholman and Demberg (2017a). In our approach, coherence relation labels are inferred from connectives that workers insert into the text. We show that the proposed method leads to replicable coherence annotations, and analyse the agreement between the obtained relation labels and annotations from PDTB and RSTDT on the same texts.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Venhuizen, Noortje; Hendriks, Petra; Crocker, Matthew W.; Brouwer, Harm

A Framework for Distributional Formal Semantics Inproceedings

Iemhoff, Rosalie; Moortgat, Michael; de Queiroz, Ruy (Ed.): Logic, Language, Information, and Computation, Proceedings of the 26th International Workshop WoLLIC 2019Logic, Language, Information, and Computation, Proceedings of the 26th International Workshop WoLLIC 2019, pp. 633-646, 2019.

Formal semantics and distributional semantics offer complementary strengths in capturing the meaning of natural language. As such, a considerable amount of research has sought to unify them, either by augmenting formal semantic systems with a distributional component, or by defining a formal system on top of distributed representations.

Arriving at such a unified framework has, however, proven extremely challenging. One reason for this is that formal and distributional semantics operate on a fundamentally different ‘representational currency’: formal semantics defines meaning in terms of models of the world, whereas distributional semantics defines meaning in terms of linguistic co-occurrence. Here, we pursue an alternative approach by deriving a vector space model that defines meaning in a distributed manner relative to formal models of the world.

We will show that the resulting Distributional Formal Semantics offers probabilistic distributed representations that are also inherently compositional, and that naturally capture quantification and entailment. We moreover show that, when used as part of a neural network model, these representations allow for capturing incremental meaning construction and probabilistic inferencing. This framework thus lays the groundwork for an integrated distributional and formal approach to meaning.

@inproceedings{Venhuizen2019b,
title = {A Framework for Distributional Formal Semantics},
author = {Noortje Venhuizen and Petra Hendriks and Matthew W. Crocker and Harm Brouwer},
editor = {Rosalie Iemhoff and Michael Moortgat and Ruy de Queiroz},
url = {https://link.springer.com/chapter/10.1007%2F978-3-662-59533-6_39},
doi = {https://doi.org/https://doi.org/10.1007/978-3-662-59533-6_39},
year = {2019},
date = {2019-06-09},
booktitle = {Logic, Language, Information, and Computation, Proceedings of the 26th International Workshop WoLLIC 2019},
pages = {633-646},
abstract = {Formal semantics and distributional semantics offer complementary strengths in capturing the meaning of natural language. As such, a considerable amount of research has sought to unify them, either by augmenting formal semantic systems with a distributional component, or by defining a formal system on top of distributed representations. Arriving at such a unified framework has, however, proven extremely challenging. One reason for this is that formal and distributional semantics operate on a fundamentally different ‘representational currency’: formal semantics defines meaning in terms of models of the world, whereas distributional semantics defines meaning in terms of linguistic co-occurrence. Here, we pursue an alternative approach by deriving a vector space model that defines meaning in a distributed manner relative to formal models of the world. We will show that the resulting Distributional Formal Semantics offers probabilistic distributed representations that are also inherently compositional, and that naturally capture quantification and entailment. We moreover show that, when used as part of a neural network model, these representations allow for capturing incremental meaning construction and probabilistic inferencing. This framework thus lays the groundwork for an integrated distributional and formal approach to meaning.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A1

Ortmann, Katrin; Dipper, Stefanie

Variation between Different Discourse Types: Literate vs. Oral Inproceedings

In Proceedings of the NAACL-Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Association for Computational Linguistics, pp. 64-79, Ann Arbor, Michigan, 2019.

This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers

@inproceedings{Ortmann2019,
title = {Variation between Different Discourse Types: Literate vs. Oral},
author = {Katrin Ortmann and Stefanie Dipper},
url = {https://aclanthology.org/W19-1407/},
doi = {https://doi.org/10.18653/v1/W19-1407},
year = {2019},
date = {2019-06-07},
booktitle = {In Proceedings of the NAACL-Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)},
pages = {64-79},
publisher = {Association for Computational Linguistics},
address = {Ann Arbor, Michigan},
abstract = {This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Shi, Wei; Yung, Frances Pik Yu; Demberg, Vera

Acquiring Annotated Data with Cross-lingual Explicitation for Implicit Discourse Relation Classification Inproceedings

Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Association for Computational Linguistics, pp. 12-21, Minneapolis, USA, 2019.

Implicit discourse relation classification is one of the most challenging and important tasks in discourse parsing, due to the lack of connectives as strong linguistic cues. A principle bottleneck to further improvement is the shortage of training data (ca. 18k instances in the Penn Discourse Treebank (PDTB)). Shi et al. (2017) proposed to acquire additional data by exploiting connectives in translation: human translators mark discourse relations which are implicit in the source language explicitly in the translation. Using back-translations of such explicitated connectives improves discourse relation parsing performance. This paper addresses the open question of whether the choice of the translation language matters, and whether multiple translations into different languages can be effectively used to improve the quality of the additional data.

@inproceedings{Shi2019,
title = {Acquiring Annotated Data with Cross-lingual Explicitation for Implicit Discourse Relation Classification},
author = {Wei Shi and Frances Pik Yu Yung and Vera Demberg},
url = {https://aclanthology.org/W19-2703},
doi = {https://doi.org/10.18653/v1/W19-2703},
year = {2019},
date = {2019-06-06},
booktitle = {Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019},
pages = {12-21},
publisher = {Association for Computational Linguistics},
address = {Minneapolis, USA},
abstract = {Implicit discourse relation classification is one of the most challenging and important tasks in discourse parsing, due to the lack of connectives as strong linguistic cues. A principle bottleneck to further improvement is the shortage of training data (ca. 18k instances in the Penn Discourse Treebank (PDTB)). Shi et al. (2017) proposed to acquire additional data by exploiting connectives in translation: human translators mark discourse relations which are implicit in the source language explicitly in the translation. Using back-translations of such explicitated connectives improves discourse relation parsing performance. This paper addresses the open question of whether the choice of the translation language matters, and whether multiple translations into different languages can be effectively used to improve the quality of the additional data.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Demberg, Vera; Scholman, Merel; Torabi Asr, Fatemeh

How compatible are our discourse annotation frameworks? Insights from mapping RST-DT and PDTB annotations Journal Article

Dialogue & Discourse , 10, pp. 87-135, 2019.

Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes comparison of the annotations difficult, thereby also preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same text, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labeling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences of these discrepancies for future annotation and for the training of automatic discourse relation labellers.

@article{Demberg2019,
title = {How compatible are our discourse annotation frameworks? Insights from mapping RST-DT and PDTB annotations},
author = {Vera Demberg and Merel Scholman and Fatemeh Torabi Asr},
url = {http://arxiv.org/abs/1704.08893},
year = {2019},
date = {2019-06-01},
journal = {Dialogue & Discourse},
pages = {87-135},
volume = {10},
number = {1},
abstract = {Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes comparison of the annotations difficult, thereby also preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same text, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labeling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences of these discrepancies for future annotation and for the training of automatic discourse relation labellers.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B2

Shi, Wei; Demberg, Vera

Learning to Explicitate Connectives with Seq2Seq Network for Implicit Discourse Relation Classification Inproceedings

In Proceedings of the 13th International Conference on Computational Semantics, Association for Computational Linguistics, pp. 188-199, Gothenburg, Sweden, 2019.

Implicit discourse relation classification is one of the most difficult steps in discourse parsing. The difficulty stems from the fact that the coherence relation must be inferred based on the content of the discourse relational arguments. Therefore, an effective encoding of the relational arguments is of crucial importance. We here propose a new model for implicit discourse relation classification, which consists of a classifier, and a sequence-to-sequence model which is trained to generate a representation of the discourse relational arguments by trying to predict the relational arguments including a suitable implicit connective. Training is possible because such implicit connectives have been annotated as part of the PDTB corpus. Along with a memory network, our model could generate more refined representations for the task. And on the now standard 11-way classification, our method outperforms the previous state of the art systems on the PDTB benchmark on multiple settings including cross validation.

@inproceedings{Shi2019b,
title = {Learning to Explicitate Connectives with Seq2Seq Network for Implicit Discourse Relation Classification},
author = {Wei Shi and Vera Demberg},
url = {https://aclanthology.org/W19-0416},
doi = {https://doi.org/10.18653/v1/W19-0416},
year = {2019},
date = {2019},
booktitle = {In Proceedings of the 13th International Conference on Computational Semantics},
pages = {188-199},
publisher = {Association for Computational Linguistics},
address = {Gothenburg, Sweden},
abstract = {Implicit discourse relation classification is one of the most difficult steps in discourse parsing. The difficulty stems from the fact that the coherence relation must be inferred based on the content of the discourse relational arguments. Therefore, an effective encoding of the relational arguments is of crucial importance. We here propose a new model for implicit discourse relation classification, which consists of a classifier, and a sequence-to-sequence model which is trained to generate a representation of the discourse relational arguments by trying to predict the relational arguments including a suitable implicit connective. Training is possible because such implicit connectives have been annotated as part of the PDTB corpus. Along with a memory network, our model could generate more refined representations for the task. And on the now standard 11-way classification, our method outperforms the previous state of the art systems on the PDTB benchmark on multiple settings including cross validation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Karakanta, Alina; Menzel, Katrin; Przybyl, Heike; Teich, Elke

Detecting linguistic variation in translated vs. interpreted texts using relative entropy Inproceedings

Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan, 2019.

Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.

@inproceedings{Karakanta2019,
title = {Detecting linguistic variation in translated vs. interpreted texts using relative entropy},
author = {Alina Karakanta and Katrin Menzel and Heike Przybyl and Elke Teich},
url = {https://www.researchgate.net/publication/336990114_Detecting_linguistic_variation_in_translated_vs_interpreted_texts_using_relative_entropy},
year = {2019},
date = {2019},
booktitle = {Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan},
abstract = {Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Krielke, Marie-Pauline; Teich, Elke

Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings Inproceedings

Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, pp. 175-185, Florence, Italy, 2019.

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.

@inproceedings{Bizzoni2019,
title = {Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Katrin Menzel and Marie-Pauline Krielke and Elke Teich},
url = {https://aclanthology.org/W19-4722},
doi = {https://doi.org/10.18653/v1/W19-4722},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change},
pages = {175-185},
publisher = {Association for Computational Linguistics},
address = {Florence, Italy},
abstract = {The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Whang, James

Effects of phonotactic predictability on sensitivity to phonetic detail Journal Article

Laboratory Phonology: Journal of the Association for Laboratory Phonology, 10, pp. 1-28, 2019.

Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.

@article{Whang2019,
title = {Effects of phonotactic predictability on sensitivity to phonetic detail},
author = {James Whang},
url = {https://www.journal-labphon.org/articles/10.5334/labphon.125/},
doi = {https://doi.org/10.5334/labphon.125},
year = {2019},
date = {2019-04-23},
journal = {Laboratory Phonology: Journal of the Association for Laboratory Phonology},
pages = {1-28},
volume = {10},
number = {1},
abstract = {Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Menzel, Katrin

Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) , Bremen, Germany, 2019.

This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.

@inproceedings{Menzel2019,
title = {Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing},
author = {Katrin Menzel},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Menzel.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Fischer, Stefan; Teich, Elke

More complex or just more diverse? Capturing diachronic linguistic variation Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS), Bremen, Germany, 2019.

We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).

@inproceedings{Fischer2019,
title = {More complex or just more diverse? Capturing diachronic linguistic variation},
author = {Stefan Fischer and Elke Teich},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Fischer_Teich.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Grosse, Kathrin; Trost, Thomas; Mosbach, Marius; Backes, Michael; Klakow, Dietrich

On the security relevance of weights in deep learning Journal Article

CoRR, 2019.

Recently, a weight-based attack on stochastic gradient descent inducing overfitting has been proposed. We show that the threat is broader: A task-independent permutation on the initial weights suffices to limit the achieved accuracy to for example 50% on the Fashion MNIST dataset from initially more than 90%. These findings are confirmed on MNIST and CIFAR. We formally confirm that the attack succeeds with high likelihood and does not depend on the data. Empirically, weight statistics and loss appear unsuspicious, making it hard to detect the attack if the user is not aware. Our paper is thus a call for action to acknowledge the importance of the initial weights in deep learning.

@article{Grosse2019,
title = {On the security relevance of weights in deep learning},
author = {Kathrin Grosse and Thomas Trost and Marius Mosbach and Michael Backes and Dietrich Klakow},
url = {https://arxiv.org/abs/1902.03020},
year = {2019},
date = {2019},
journal = {CoRR},
abstract = {Recently, a weight-based attack on stochastic gradient descent inducing overfitting has been proposed. We show that the threat is broader: A task-independent permutation on the initial weights suffices to limit the achieved accuracy to for example 50% on the Fashion MNIST dataset from initially more than 90%. These findings are confirmed on MNIST and CIFAR. We formally confirm that the attack succeeds with high likelihood and does not depend on the data. Empirically, weight statistics and loss appear unsuspicious, making it hard to detect the attack if the user is not aware. Our paper is thus a call for action to acknowledge the importance of the initial weights in deep learning.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Engonopoulos, Nikos; Teichmann, Christoph; Koller, Alexander

Discovering user groups for natural language generation Inproceedings

Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018.

We present a model which predicts how individual users of a dialog system understand and produce utterances based on user groups. In contrast to previous work, these user groups are not specified beforehand, but learned in training. We evaluate on two referring expression (RE) generation tasks; our experiments show that our model can identify user groups and learn how to most effectively talk to them, and can dynamically assign unseen users to the correct groups as they interact with the system.

@inproceedings{Engonopoulos2018discovering,
title = {Discovering user groups for natural language generation},
author = {Nikos Engonopoulos and Christoph Teichmann and Alexander Koller},
url = {https://arxiv.org/abs/1806.05947},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue},
abstract = {We present a model which predicts how individual users of a dialog system understand and produce utterances based on user groups. In contrast to previous work, these user groups are not specified beforehand, but learned in training. We evaluate on two referring expression (RE) generation tasks; our experiments show that our model can identify user groups and learn how to most effectively talk to them, and can dynamically assign unseen users to the correct groups as they interact with the system.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A7

Jágrová, Klára; Avgustinova, Tania; Stenger, Irina; Fischer, Andrea

Language models, surprisal and fantasy in Slavic intercomprehension Journal Article

Computer Speech & Language, 2018.

In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance.

Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models.

We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.

@article{Jágrová2018b,
title = {Language models, surprisal and fantasy in Slavic intercomprehension},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Tania Avgustinova and Irina Stenger and Andrea Fischer},
url = {https://www.sciencedirect.com/science/article/pii/S0885230817300451},
year = {2018},
date = {2018},
journal = {Computer Speech & Language},
abstract = {In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C4

Jágrová, Klára; Stenger, Irina; Avgustinova, Tania

Polski nadal nieskomplikowany? Interkomprehensionsexperimente mit Nominalphrasen Journal Article

Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkräfte, 5/2017, pp. 20-37, 2018.

@article{Jágrová2018,
title = {Polski nadal nieskomplikowany? Interkomprehensionsexperimente mit Nominalphrasen},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Irina Stenger and Tania Avgustinova},
year = {2018},
date = {2018},
journal = {Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkr{\"a}fte},
pages = {20-37},
volume = {5/2017},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C4

Tourtouri, Elli; Sikos, Les; Crocker, Matthew W.

Referential Entropy influences Overspecification: Evidence from Production Miscellaneous

31st Annual CUNY Sentence Processing Conference, UC Davis, Davis CA, USA, 2018.

Specificity in referential communication

  • Grice’s Maxim of Quantity [1]: Speakers should produce only informa9on that is strictly necessary for identifying the target
  • However, it is possible to establish reference with either minimally-specified (MS; precise) or over-specified (OS; redundant) expressions
  • Moreover, speakers overspecify frequently and systematically [e.g., 2-6]

Q: Why do people overspecificy?

 

@miscellaneous{Tourtourietal2018a,
title = {Referential Entropy influences Overspecification: Evidence from Production},
author = {Elli Tourtouri and Les Sikos and Matthew W. Crocker},
url = {https://www.researchgate.net/publication/323809271_Referential_entropy_influences_overspecification_Evidence_from_production},
year = {2018},
date = {2018},
booktitle = {31st Annual CUNY Sentence Processing Conference},
publisher = {UC Davis},
address = {Davis CA, USA},
abstract = {Specificity in referential communication

  • Grice’s Maxim of Quantity [1]: Speakers should produce only informa9on that is strictly necessary for identifying the target
  • However, it is possible to establish reference with either minimally-specified (MS; precise) or over-specified (OS; redundant) expressions
  • Moreover, speakers overspecify frequently and systematically [e.g., 2-6]
Q: Why do people overspecificy?},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   C3

Successfully