Publications

Shi, Wei; Yung, Frances Pik Yu; Demberg, Vera

Acquiring Annotated Data with Cross-lingual Explicitation for Implicit Discourse Relation Classification Inproceedings

Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, Association for Computational Linguistics, pp. 12-21, Minneapolis, USA, 2019.

Implicit discourse relation classification is one of the most challenging and important tasks in discourse parsing, due to the lack of connectives as strong linguistic cues. A principle bottleneck to further improvement is the shortage of training data (ca. 18k instances in the Penn Discourse Treebank (PDTB)). Shi et al. (2017) proposed to acquire additional data by exploiting connectives in translation: human translators mark discourse relations which are implicit in the source language explicitly in the translation. Using back-translations of such explicitated connectives improves discourse relation parsing performance. This paper addresses the open question of whether the choice of the translation language matters, and whether multiple translations into different languages can be effectively used to improve the quality of the additional data.

@inproceedings{Shi2019,
title = {Acquiring Annotated Data with Cross-lingual Explicitation for Implicit Discourse Relation Classification},
author = {Wei Shi and Frances Pik Yu Yung and Vera Demberg},
url = {https://aclanthology.org/W19-2703},
doi = {https://doi.org/10.18653/v1/W19-2703},
year = {2019},
date = {2019-06-06},
booktitle = {Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019},
pages = {12-21},
publisher = {Association for Computational Linguistics},
address = {Minneapolis, USA},
abstract = {Implicit discourse relation classification is one of the most challenging and important tasks in discourse parsing, due to the lack of connectives as strong linguistic cues. A principle bottleneck to further improvement is the shortage of training data (ca. 18k instances in the Penn Discourse Treebank (PDTB)). Shi et al. (2017) proposed to acquire additional data by exploiting connectives in translation: human translators mark discourse relations which are implicit in the source language explicitly in the translation. Using back-translations of such explicitated connectives improves discourse relation parsing performance. This paper addresses the open question of whether the choice of the translation language matters, and whether multiple translations into different languages can be effectively used to improve the quality of the additional data.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Demberg, Vera; Scholman, Merel; Torabi Asr, Fatemeh

How compatible are our discourse annotation frameworks? Insights from mapping RST-DT and PDTB annotations Journal Article

Dialogue & Discourse , 10, pp. 87-135, 2019.

Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes comparison of the annotations difficult, thereby also preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same text, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labeling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences of these discrepancies for future annotation and for the training of automatic discourse relation labellers.

@article{Demberg2019,
title = {How compatible are our discourse annotation frameworks? Insights from mapping RST-DT and PDTB annotations},
author = {Vera Demberg and Merel Scholman and Fatemeh Torabi Asr},
url = {http://arxiv.org/abs/1704.08893},
year = {2019},
date = {2019-06-01},
journal = {Dialogue & Discourse},
pages = {87-135},
volume = {10},
number = {1},
abstract = {Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes comparison of the annotations difficult, thereby also preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same text, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labeling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences of these discrepancies for future annotation and for the training of automatic discourse relation labellers.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B2

Shi, Wei; Demberg, Vera

Learning to Explicitate Connectives with Seq2Seq Network for Implicit Discourse Relation Classification Inproceedings

In Proceedings of the 13th International Conference on Computational Semantics, Association for Computational Linguistics, pp. 188-199, Gothenburg, Sweden, 2019.

Implicit discourse relation classification is one of the most difficult steps in discourse parsing. The difficulty stems from the fact that the coherence relation must be inferred based on the content of the discourse relational arguments. Therefore, an effective encoding of the relational arguments is of crucial importance. We here propose a new model for implicit discourse relation classification, which consists of a classifier, and a sequence-to-sequence model which is trained to generate a representation of the discourse relational arguments by trying to predict the relational arguments including a suitable implicit connective. Training is possible because such implicit connectives have been annotated as part of the PDTB corpus. Along with a memory network, our model could generate more refined representations for the task. And on the now standard 11-way classification, our method outperforms the previous state of the art systems on the PDTB benchmark on multiple settings including cross validation.

@inproceedings{Shi2019b,
title = {Learning to Explicitate Connectives with Seq2Seq Network for Implicit Discourse Relation Classification},
author = {Wei Shi and Vera Demberg},
url = {https://aclanthology.org/W19-0416},
doi = {https://doi.org/10.18653/v1/W19-0416},
year = {2019},
date = {2019},
booktitle = {In Proceedings of the 13th International Conference on Computational Semantics},
pages = {188-199},
publisher = {Association for Computational Linguistics},
address = {Gothenburg, Sweden},
abstract = {Implicit discourse relation classification is one of the most difficult steps in discourse parsing. The difficulty stems from the fact that the coherence relation must be inferred based on the content of the discourse relational arguments. Therefore, an effective encoding of the relational arguments is of crucial importance. We here propose a new model for implicit discourse relation classification, which consists of a classifier, and a sequence-to-sequence model which is trained to generate a representation of the discourse relational arguments by trying to predict the relational arguments including a suitable implicit connective. Training is possible because such implicit connectives have been annotated as part of the PDTB corpus. Along with a memory network, our model could generate more refined representations for the task. And on the now standard 11-way classification, our method outperforms the previous state of the art systems on the PDTB benchmark on multiple settings including cross validation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Karakanta, Alina; Menzel, Katrin; Przybyl, Heike; Teich, Elke

Detecting linguistic variation in translated vs. interpreted texts using relative entropy Inproceedings

Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan, 2019.

Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.

@inproceedings{Karakanta2019,
title = {Detecting linguistic variation in translated vs. interpreted texts using relative entropy},
author = {Alina Karakanta and Katrin Menzel and Heike Przybyl and Elke Teich},
url = {https://www.researchgate.net/publication/336990114_Detecting_linguistic_variation_in_translated_vs_interpreted_texts_using_relative_entropy},
year = {2019},
date = {2019},
booktitle = {Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan},
abstract = {Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Krielke, Marie-Pauline; Teich, Elke

Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings Inproceedings

Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, pp. 175-185, Florence, Italy, 2019.

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.

@inproceedings{Bizzoni2019,
title = {Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Katrin Menzel and Marie-Pauline Krielke and Elke Teich},
url = {https://aclanthology.org/W19-4722},
doi = {https://doi.org/10.18653/v1/W19-4722},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change},
pages = {175-185},
publisher = {Association for Computational Linguistics},
address = {Florence, Italy},
abstract = {The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Whang, James

Effects of phonotactic predictability on sensitivity to phonetic detail Journal Article

Laboratory Phonology: Journal of the Association for Laboratory Phonology, 10, pp. 1-28, 2019.

Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.

@article{Whang2019,
title = {Effects of phonotactic predictability on sensitivity to phonetic detail},
author = {James Whang},
url = {https://www.journal-labphon.org/articles/10.5334/labphon.125/},
doi = {https://doi.org/10.5334/labphon.125},
year = {2019},
date = {2019-04-23},
journal = {Laboratory Phonology: Journal of the Association for Laboratory Phonology},
pages = {1-28},
volume = {10},
number = {1},
abstract = {Japanese speakers systematically devoice or delete high vowels [i, u] between two voiceless consonants. Japanese listeners also report perceiving the same high vowels between consonant clusters even in the absence of a vocalic segment. Although perceptual vowel epenthesis has been described primarily as a phonotactic repair strategy, where a phonetically minimal vowel is epenthesized by default, few studies have investigated how the predictability of a vowel in a given context affects the choice of epenthetic vowel. The present study uses a forced-choice labeling task to test how sensitive Japanese listeners are to coarticulatory cues of high vowels [i, u] and non-high vowel [a] in devoicing and non-devoicing contexts. Devoicing contexts were further divided into high-predictability contexts, where the phonotactic distribution strongly favors one of the high vowels, and low-predictability contexts, where both high vowels are allowed, to specifically test for the effects of predictability. Results reveal a strong tendency towards [u] epenthesis as previous studies have found, but the results also reveal a sensitivity to coarticulatory cues that override the default [u] epenthesis, particularly in low-predictability contexts. Previous studies have shown that predictability affects phonetic implementation during production, and this study provides evidence predictability has similar effects during perception.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C1

Menzel, Katrin

Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) , Bremen, Germany, 2019.

This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.

@inproceedings{Menzel2019,
title = {Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing},
author = {Katrin Menzel},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Menzel.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Fischer, Stefan; Teich, Elke

More complex or just more diverse? Capturing diachronic linguistic variation Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS), Bremen, Germany, 2019.

We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).

@inproceedings{Fischer2019,
title = {More complex or just more diverse? Capturing diachronic linguistic variation},
author = {Stefan Fischer and Elke Teich},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Fischer_Teich.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Grosse, Kathrin; Trost, Thomas; Mosbach, Marius; Backes, Michael; Klakow, Dietrich

On the security relevance of weights in deep learning Journal Article

CoRR, 2019.

Recently, a weight-based attack on stochastic gradient descent inducing overfitting has been proposed. We show that the threat is broader: A task-independent permutation on the initial weights suffices to limit the achieved accuracy to for example 50% on the Fashion MNIST dataset from initially more than 90%. These findings are confirmed on MNIST and CIFAR. We formally confirm that the attack succeeds with high likelihood and does not depend on the data. Empirically, weight statistics and loss appear unsuspicious, making it hard to detect the attack if the user is not aware. Our paper is thus a call for action to acknowledge the importance of the initial weights in deep learning.

@article{Grosse2019,
title = {On the security relevance of weights in deep learning},
author = {Kathrin Grosse and Thomas Trost and Marius Mosbach and Michael Backes and Dietrich Klakow},
url = {https://arxiv.org/abs/1902.03020},
year = {2019},
date = {2019},
journal = {CoRR},
abstract = {Recently, a weight-based attack on stochastic gradient descent inducing overfitting has been proposed. We show that the threat is broader: A task-independent permutation on the initial weights suffices to limit the achieved accuracy to for example 50% on the Fashion MNIST dataset from initially more than 90%. These findings are confirmed on MNIST and CIFAR. We formally confirm that the attack succeeds with high likelihood and does not depend on the data. Empirically, weight statistics and loss appear unsuspicious, making it hard to detect the attack if the user is not aware. Our paper is thus a call for action to acknowledge the importance of the initial weights in deep learning.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Engonopoulos, Nikos; Teichmann, Christoph; Koller, Alexander

Discovering user groups for natural language generation Inproceedings

Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, 2018.

We present a model which predicts how individual users of a dialog system understand and produce utterances based on user groups. In contrast to previous work, these user groups are not specified beforehand, but learned in training. We evaluate on two referring expression (RE) generation tasks; our experiments show that our model can identify user groups and learn how to most effectively talk to them, and can dynamically assign unseen users to the correct groups as they interact with the system.

@inproceedings{Engonopoulos2018discovering,
title = {Discovering user groups for natural language generation},
author = {Nikos Engonopoulos and Christoph Teichmann and Alexander Koller},
url = {https://arxiv.org/abs/1806.05947},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue},
abstract = {We present a model which predicts how individual users of a dialog system understand and produce utterances based on user groups. In contrast to previous work, these user groups are not specified beforehand, but learned in training. We evaluate on two referring expression (RE) generation tasks; our experiments show that our model can identify user groups and learn how to most effectively talk to them, and can dynamically assign unseen users to the correct groups as they interact with the system.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A7

Jágrová, Klára; Avgustinova, Tania; Stenger, Irina; Fischer, Andrea

Language models, surprisal and fantasy in Slavic intercomprehension Journal Article

Computer Speech & Language, 2018.

In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance.

Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models.

We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.

@article{Jágrová2018b,
title = {Language models, surprisal and fantasy in Slavic intercomprehension},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Tania Avgustinova and Irina Stenger and Andrea Fischer},
url = {https://www.sciencedirect.com/science/article/pii/S0885230817300451},
year = {2018},
date = {2018},
journal = {Computer Speech & Language},
abstract = {In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C4

Jágrová, Klára; Stenger, Irina; Avgustinova, Tania

Polski nadal nieskomplikowany? Interkomprehensionsexperimente mit Nominalphrasen Journal Article

Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkräfte, 5/2017, pp. 20-37, 2018.

@article{Jágrová2018,
title = {Polski nadal nieskomplikowany? Interkomprehensionsexperimente mit Nominalphrasen},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Irina Stenger and Tania Avgustinova},
year = {2018},
date = {2018},
journal = {Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkr{\"a}fte},
pages = {20-37},
volume = {5/2017},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C4

Tourtouri, Elli; Sikos, Les; Crocker, Matthew W.

Referential Entropy influences Overspecification: Evidence from Production Miscellaneous

31st Annual CUNY Sentence Processing Conference, UC Davis, Davis CA, USA, 2018.

Specificity in referential communication

  • Grice’s Maxim of Quantity [1]: Speakers should produce only informa9on that is strictly necessary for identifying the target
  • However, it is possible to establish reference with either minimally-specified (MS; precise) or over-specified (OS; redundant) expressions
  • Moreover, speakers overspecify frequently and systematically [e.g., 2-6]

Q: Why do people overspecificy?

 

@miscellaneous{Tourtourietal2018a,
title = {Referential Entropy influences Overspecification: Evidence from Production},
author = {Elli Tourtouri and Les Sikos and Matthew W. Crocker},
url = {https://www.researchgate.net/publication/323809271_Referential_entropy_influences_overspecification_Evidence_from_production},
year = {2018},
date = {2018},
booktitle = {31st Annual CUNY Sentence Processing Conference},
publisher = {UC Davis},
address = {Davis CA, USA},
abstract = {Specificity in referential communication

  • Grice’s Maxim of Quantity [1]: Speakers should produce only informa9on that is strictly necessary for identifying the target
  • However, it is possible to establish reference with either minimally-specified (MS; precise) or over-specified (OS; redundant) expressions
  • Moreover, speakers overspecify frequently and systematically [e.g., 2-6]
Q: Why do people overspecificy?},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   C3

Karakanta, Alina; Przybyl, Heike; Teich, Elke

Exploring Variation in Translation with Relative Entropy Inproceedings

Lavid-López, Carmen Maíz-Arévalo and Juan Rafael Zamorano-Mansilla, Julia (Ed.): Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations, John Benjamins Publishing Company, pp. 307–323, 2018.

While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.

@inproceedings{Karakanta2018b,
title = {Exploring Variation in Translation with Relative Entropy},
author = {Alina Karakanta and Heike Przybyl and Elke Teich},
editor = {Julia Lavid-López Carmen Ma{\'i}z-Ar{\'e}valo and Juan Rafael Zamorano-Mansilla},
url = {https://benjamins.com/catalog/btl.158.12kar},
doi = {https://doi.org/10.1075/btl.158.12kar},
year = {2018},
date = {2018},
booktitle = {Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations},
pages = {307–323},
publisher = {John Benjamins Publishing Company},
abstract = {

While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Karakanta, Alina; Vela, Mihaela; Teich, Elke

EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates Inproceedings

ParlaCLARIN workshop, 11th Language Resources and Evaluation Conference (LREC2018), Miyazaki, Japan, 2018.

Multilingual parliaments have been a useful source for monolingual and multilingual corpus collection. However, extra-textual information about speakers is often absent, and as a result, these resources cannot be fully used in translation studies.

In this paper we present a method for processing and building a parallel corpus consisting of parliamentary debates of the European Parliament for English into German and English into Spanish, where original language and native speaker information is available as metadata. The paperdocumentsallnecessary(pre-andpost-)processingstepsforcreatingsuchavaluableresource. Inadditiontotheparallelcorpora, we collect monolingual comparable corpora for English, German and Spanish using the same method.

@inproceedings{Karakanta2018b,
title = {EuroParl-UdS: Preserving and Extending Metadata in Parliamentary Debates},
author = {Alina Karakanta and Mihaela Vela and Elke Teich},
url = {http://lrec-conf.org/workshops/lrec2018/W2/pdf/10_W2.pdf},
year = {2018},
date = {2018},
booktitle = {ParlaCLARIN workshop, 11th Language Resources and Evaluation Conference (LREC2018)},
address = {Miyazaki, Japan},
abstract = {Multilingual parliaments have been a useful source for monolingual and multilingual corpus collection. However, extra-textual information about speakers is often absent, and as a result, these resources cannot be fully used in translation studies. In this paper we present a method for processing and building a parallel corpus consisting of parliamentary debates of the European Parliament for English into German and English into Spanish, where original language and native speaker information is available as metadata. The paperdocumentsallnecessary(pre-andpost-)processingstepsforcreatingsuchavaluableresource. Inadditiontotheparallelcorpora, we collect monolingual comparable corpora for English, German and Spanish using the same method.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Collard, Camille; Przybyl, Heike; Defrancq, Bart

Interpreting into an SOV Language: Memory and the Position of the Verb. A Corpus-Based Comparative Study of Interpreted and Non-mediated Speech Journal Article

Küblera, Nathalie; Loock, Rudy; Pecman, Mojca (Ed.): Meta, 63, Les Presses de l’Université de Montréal, pp. 695-716, 2018.

In Dutch and German subordinate clauses, the verb is generally placed after the clausal constituents (Subject-Object-Verb structure) thereby creating a middle field (or verbal brace). This makes interpreting from SOV into SVO languages particularly challenging as it requires further processing and feats of memory. It often requires interpreters to use specific strategies (for example, anticipation) (Lederer 1981; Liontou 2011). However, few studies have tackled this issue from the point of view of interpreting into SOV languages. Producing SOV structures requires some specific cognitive effort as, for instance, subject properties need to be kept in mind in order to ensure the correct subject-verb agreement across a span of 10 or 20 words. Speakers therefore often opt for a strategy called extraposition, placing specific elements after the verb in order to shorten the brace (Hawkins 1994; Bevilacqua 2009). Dutch speakers use this strategy more often than German speakers (Haeseryn 1990). Given the additional cognitive load generated by the interpreting process (Gile 1999), it may be assumed that interpreters will shorten the verbal brace to a larger extent than native speakers.

The present study is based on a corpus of interpreted and non-mediated speeches at the European Parliament and compares middle field lengths as well as extraposition in Dutch and German subordinate clauses. Results from 3460 subordinate clauses confirm that interpreters of both languages shorten the middle field more than native speakers. The study also shows that German interpreters use extraposition more often than native speakers, but this is not the case for Dutch interpreters. Dutch and German interpreters appear to use extraposition partly because they imitate the clause word order of the source speech, showing that, in this case, extraposition can be considered an effort-saving tool.

@article{Collard2018,
title = {Interpreting into an SOV Language: Memory and the Position of the Verb. A Corpus-Based Comparative Study of Interpreted and Non-mediated Speech},
author = {Camille Collard and Heike Przybyl and Bart Defrancq},
editor = {Nathalie K{\"u}blera and Rudy Loock and Mojca Pecman},
url = {https://id.erudit.org/iderudit/1060169ar},
doi = {https://doi.org/10.7202/1060169ar},
year = {2018},
date = {2018},
journal = {Meta},
pages = {695-716},
publisher = {Les Presses de l’Universit{\'e} de Montr{\'e}al},
volume = {63},
number = {3},
abstract = {In Dutch and German subordinate clauses, the verb is generally placed after the clausal constituents (Subject-Object-Verb structure) thereby creating a middle field (or verbal brace). This makes interpreting from SOV into SVO languages particularly challenging as it requires further processing and feats of memory. It often requires interpreters to use specific strategies (for example, anticipation) (Lederer 1981; Liontou 2011). However, few studies have tackled this issue from the point of view of interpreting into SOV languages. Producing SOV structures requires some specific cognitive effort as, for instance, subject properties need to be kept in mind in order to ensure the correct subject-verb agreement across a span of 10 or 20 words. Speakers therefore often opt for a strategy called extraposition, placing specific elements after the verb in order to shorten the brace (Hawkins 1994; Bevilacqua 2009). Dutch speakers use this strategy more often than German speakers (Haeseryn 1990). Given the additional cognitive load generated by the interpreting process (Gile 1999), it may be assumed that interpreters will shorten the verbal brace to a larger extent than native speakers. The present study is based on a corpus of interpreted and non-mediated speeches at the European Parliament and compares middle field lengths as well as extraposition in Dutch and German subordinate clauses. Results from 3460 subordinate clauses confirm that interpreters of both languages shorten the middle field more than native speakers. The study also shows that German interpreters use extraposition more often than native speakers, but this is not the case for Dutch interpreters. Dutch and German interpreters appear to use extraposition partly because they imitate the clause word order of the source speech, showing that, in this case, extraposition can be considered an effort-saving tool.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B7

Reich, Ingo

Ellipsen Book Chapter

Liedtke, Frank; Tuchen, Astrid (Ed.): Handbuch Pragmatik, J.B. Metzler, pp. 240-251, Stuttgart, 2018, ISBN 978-3-476-04624-6.

Der Begriff ›Ellipse‹ wird in der Literatur nicht einheitlich verwendet und ist aufgrund der Heterogenität des Phänomenbereichs auch nicht ganz einfach zu definieren. In erster Annäherung kann man unter Ellipsen sprachliche Äußerungen verstehen, die in einem zu präzisierenden Sinne unvollständig sind oder von kompetenten Sprecher/innen (des Deutschen) als unvollständig aufgefasst werden.

@inbook{Reich2018,
title = {Ellipsen},
author = {Ingo Reich},
editor = {Frank Liedtke and Astrid Tuchen},
url = {https://doi.org/10.1007/978-3-476-04624-6_24},
doi = {https://doi.org/10.1007/978-3-476-04624-6_24},
year = {2018},
date = {2018},
booktitle = {Handbuch Pragmatik},
isbn = {978-3-476-04624-6},
pages = {240-251},
publisher = {J.B. Metzler},
address = {Stuttgart},
abstract = {Der Begriff ›Ellipse‹ wird in der Literatur nicht einheitlich verwendet und ist aufgrund der Heterogenit{\"a}t des Ph{\"a}nomenbereichs auch nicht ganz einfach zu definieren. In erster Ann{\"a}herung kann man unter Ellipsen sprachliche {\"A}u{\ss}erungen verstehen, die in einem zu pr{\"a}zisierenden Sinne unvollst{\"a}ndig sind oder von kompetenten Sprecher/innen (des Deutschen) als unvollst{\"a}ndig aufgefasst werden.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B3

Crible, Ludivine; Demberg, Vera

The effect of genre variation on the production and acceptability of underspecified discourse markers in English Inproceedings

20th DiscourseNet, Budapest, Hungary, 2018.

@inproceedings{Crible2018,
title = {The effect of genre variation on the production and acceptability of underspecified discourse markers in English},
author = {Ludivine Crible and Vera Demberg},
url = {https://dial.uclouvain.be/pr/boreal/object/boreal:192393},
year = {2018},
date = {2018},
publisher = {20th DiscourseNet},
address = {Budapest, Hungary},
abstract = {

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B2

Degaetano-Ortlieb, Stefania; Teich, Elke

Using relative entropy for detection and analysis of periods of diachronic linguistic change Inproceedings

Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018, Association for Computational Linguistics , pp. 22-33, Santa Fe, New Mexico, 2018.

We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.

@inproceedings{Degaetano-Ortlieb2018b,
title = {Using relative entropy for detection and analysis of periods of diachronic linguistic change},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W18-4503},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018},
pages = {22-33},
publisher = {Association for Computational Linguistics},
address = {Santa Fe, New Mexico},
abstract = {We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Teich, Elke; Fankhauser, Peter

Aspects of Linguistic and Computational Modeling in Language Science Book Chapter

Flanders, Julia; Jannidis, Fotis (Ed.): The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities). , Routledge, Taylor & Francis, pp. 236-249, New York, 2018.

Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.

@inbook{Teich2018,
title = {Aspects of Linguistic and Computational Modeling in Language Science},
author = {Elke Teich and Peter Fankhauser},
editor = {Julia Flanders and Fotis Jannidis},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/34320},
year = {2018},
date = {2018},
booktitle = {The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities).},
pages = {236-249},
publisher = {Routledge, Taylor & Francis},
address = {New York},
abstract = {Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Successfully