Publications

Avgustinova, Tania; Stenger, Irina

Russian-Bulgarian mutual intelligibility in light of linguistic and statistical models of Slavic receptive multilingualism [Russko-bolgarskaja vzaimoponjatnost’ v svete lingvističeskich i statističeskich modelej slavjanskoj receptivnoj mnogojazyčnocsti] Book Chapter

Marti, Roland; Pognan, Patrice; Schlamberger Brezar, Mojca (Ed.): University Press, Faculty of Arts, pp. 85-99, Ljubljana, Slovenia, 2020.

Computational modelling of the observed mutual intelligibility of Slavic languages unavoid-ably requires systematic integration of classical Slavistics knowledge from comparative his-torical grammar and traditional contrastive description of language pairs. The phenomenon of intercomprehension is quite intuitive: speakers of a given language L1 understand another closely related language (variety) L2 without being able to use the latter productively, i.e. for speaking or writing.

This specific mode of using the human linguistic competence manifests itself as receptive multilingualism. The degree of mutual understanding of genetically close-ly related languages, such as Bulgarian and Russian, corresponds to objectively measurable distances at different linguistic levels. The common Slavic basis and the comparative-syn-chronous perspective allow us to reveal Bulgarian-Russian linguistic affinity with regard to spelling, vocabulary and grammar.

@inbook{Avgustinova2020,
title = {Russian-Bulgarian mutual intelligibility in light of linguistic and statistical models of Slavic receptive multilingualism [Russko-bolgarskaja vzaimoponjatnost’ v svete lingvisti{\v{c}eskich i statisti{\v{c}eskich modelej slavjanskoj receptivnoj mnogojazy{\v{c}nocsti]},
author = {Tania Avgustinova and Irina Stenger},
editor = {Roland Marti and Patrice Pognan and Mojca Schlamberger Brezar},
url = {https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/226/326/5284-1},
year = {2020},
date = {2020},
pages = {85-99},
publisher = {University Press, Faculty of Arts},
address = {Ljubljana, Slovenia},
abstract = {Computational modelling of the observed mutual intelligibility of Slavic languages unavoid-ably requires systematic integration of classical Slavistics knowledge from comparative his-torical grammar and traditional contrastive description of language pairs. The phenomenon of intercomprehension is quite intuitive: speakers of a given language L1 understand another closely related language (variety) L2 without being able to use the latter productively, i.e. for speaking or writing. This specific mode of using the human linguistic competence manifests itself as receptive multilingualism. The degree of mutual understanding of genetically close-ly related languages, such as Bulgarian and Russian, corresponds to objectively measurable distances at different linguistic levels. The common Slavic basis and the comparative-syn-chronous perspective allow us to reveal Bulgarian-Russian linguistic affinity with regard to spelling, vocabulary and grammar.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina; Avgustinova, Tania

Visual vs. auditory perception of Bulgarian stimuli by Russian native speakers Inproceedings

P. Selegej, Vladimir et al. (Ed.): Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ‘Dialogue’, pp. 684 - 695, 2020.

This study contributes to a better understanding of receptive multilingualism by determining similarities and differences in successful processing of written and spoken cognate words in an unknown but (closely) related language. We investigate two Slavic languages with regard to their mutual intelligibility. The current focus is on the recognition of isolated Bulgarian words by Russian native speakers in a cognate guessing task, considering both written and audio stimuli.

The experimentally obtained intercomprehension scores show a generally high degree of intelligibility of Bulgarian cognates to Russian subjects, as well as processing difficulties in case of visual vs. auditory perception. In search of an explanation, we examine the linguistic factors that can contribute to various degrees of written and spoken word intelligibility. The intercomprehension scores obtained in the online word translation experiments are correlated with (i) the identical and mismatched correspondences on the orthographic and phonetic level, (ii) the word length of the stimuli, and (iii) the frequency of Russian cognates. Additionally we validate two measuring methods: the Levenshtein distance and the word adaptation surprisal as potential predictors of the word intelligibility in reading and oral intercomprehension.

@inproceedings{Stenger2020b,
title = {Visual vs. auditory perception of Bulgarian stimuli by Russian native speakers},
author = {Irina Stenger and Tania Avgustinova},
editor = {Vladimir P. Selegej et al.},
url = {http://www.dialog-21.ru/media/4962/stengeriplusavgustinovat-045.pdf},
year = {2020},
date = {2020},
booktitle = {Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference ‘Dialogue’},
pages = {684 - 695},
abstract = {This study contributes to a better understanding of receptive multilingualism by determining similarities and differences in successful processing of written and spoken cognate words in an unknown but (closely) related language. We investigate two Slavic languages with regard to their mutual intelligibility. The current focus is on the recognition of isolated Bulgarian words by Russian native speakers in a cognate guessing task, considering both written and audio stimuli. The experimentally obtained intercomprehension scores show a generally high degree of intelligibility of Bulgarian cognates to Russian subjects, as well as processing difficulties in case of visual vs. auditory perception. In search of an explanation, we examine the linguistic factors that can contribute to various degrees of written and spoken word intelligibility. The intercomprehension scores obtained in the online word translation experiments are correlated with (i) the identical and mismatched correspondences on the orthographic and phonetic level, (ii) the word length of the stimuli, and (iii) the frequency of Russian cognates. Additionally we validate two measuring methods: the Levenshtein distance and the word adaptation surprisal as potential predictors of the word intelligibility in reading and oral intercomprehension.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Avgustinova, Tania; Jágrová, Klára; Stenger, Irina

The INCOMSLAV Platform: Experimental Website with Integrated Methods for Measuring Linguistic Distances and Asymmetries in Receptive Multilingualism Inproceedings

Fiumara, James; Cieri, Christopher; Liberman, Mark; Callison-Burch, Chris (Ed.): LREC 2020 Workshop Language Resources and Evaluation Conference 11-16 May 2020, Citizen Linguistics in Language Resource Development (CLLRD 2020), Peter Lang, pp. 483-500, 2020.

We report on a web-based resource for conducting intercomprehension experiments with native speakers of Slavic languages and present our methods for measuring linguistic distances and asymmetries in receptive multilingualism. Through a website which serves as a platform for online testing, a large number of participants with different linguistic backgrounds can be targeted. A statistical language model is used to measure information density and to gauge how language users master various degrees of (un)intelligibilty. The key idea is that intercomprehension should be better when the model adapted for understanding the unknown language exhibits relatively low average distance and surprisal. All obtained intelligibility scores together with distance and asymmetry measures for the different language pairs and processing directions are made available as an integrated online resource in the form of a Slavic intercomprehension matrix (SlavMatrix).

@inproceedings{Stenger2020b,
title = {The INCOMSLAV Platform: Experimental Website with Integrated Methods for Measuring Linguistic Distances and Asymmetries in Receptive Multilingualism},
author = {Tania Avgustinova and Kl{\'a}ra J{\'a}grov{\'a} and Irina Stenger},
editor = {James Fiumara and Christopher Cieri and Mark Liberman and Chris Callison-Burch},
url = {https://aclanthology.org/2020.cllrd-1.6/},
doi = {https://doi.org/10.3726/978-3-653-07147-4},
year = {2020},
date = {2020},
booktitle = {LREC 2020 Workshop Language Resources and Evaluation Conference 11-16 May 2020, Citizen Linguistics in Language Resource Development (CLLRD 2020)},
pages = {483-500},
publisher = {Peter Lang},
abstract = {We report on a web-based resource for conducting intercomprehension experiments with native speakers of Slavic languages and present our methods for measuring linguistic distances and asymmetries in receptive multilingualism. Through a website which serves as a platform for online testing, a large number of participants with different linguistic backgrounds can be targeted. A statistical language model is used to measure information density and to gauge how language users master various degrees of (un)intelligibilty. The key idea is that intercomprehension should be better when the model adapted for understanding the unknown language exhibits relatively low average distance and surprisal. All obtained intelligibility scores together with distance and asymmetry measures for the different language pairs and processing directions are made available as an integrated online resource in the form of a Slavic intercomprehension matrix (SlavMatrix).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina; Jágrová, Klára; Fischer, Andrea; Avgustinova, Tania

“Reading Polish with Czech Eyes” or “How Russian Can a Bulgarian Text Be?”: Orthographic Differences as an Experimental Variable in Slavic Intercomprehension Incollection

Radeva-Bork, Teodora; Kosta, Peter (Ed.): Current Developments in Slavic Linguistics. Twenty Years After (based on selected papers from FDSL 11), Peter Lang, pp. 483-500, 2020.

@incollection{Stenger2020,
title = {“Reading Polish with Czech Eyes” or “How Russian Can a Bulgarian Text Be?”: Orthographic Differences as an Experimental Variable in Slavic Intercomprehension},
author = {Irina Stenger and Kl{\'a}ra J{\'a}grov{\'a} and Andrea Fischer and Tania Avgustinova},
editor = {Teodora Radeva-Bork and Peter Kosta},
url = {https://www.peterlang.com/view/title/19540},
doi = {https://doi.org/10.3726/978-3-653-07147-4},
year = {2020},
date = {2020},
booktitle = {Current Developments in Slavic Linguistics. Twenty Years After (based on selected papers from FDSL 11)},
pages = {483-500},
publisher = {Peter Lang},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   C4

Jachmann, Torsten

The immediate influence of speaker gaze on situated speech comprehension : evidence from multiple ERP components PhD Thesis

Saarland University, Saarbruecken, Germany, 2020.

This thesis presents results from three ERP experiments on the influence of speaker gaze on listeners’ sentence comprehension with focus on the utilization of speaker gaze as part of the communicative signal. The first two experiments investigated whether speaker gaze was utilized in situated communication to form expectations about upcoming referents in an unfolding sentence. Participants were presented with a face performing gaze actions toward three objects surrounding it time aligned to utterances that compared two of the three objects.

Participants were asked to judge whether the sentence they heard was true given the provided scene. Gaze cues preceded the naming of the corresponding object by 800ms. The gaze cue preceding the mentioning of the second object was manipulated such that it was either Congruent, Incongruent or Uninformative (Averted toward an empty position in experiment 1 and Mutual (redirected toward the listener) in Experiment 2). The results showed that speaker gaze was used to form expectations about the unfolding sentence indicated by three observed ERP components that index different underlying mechanisms of language comprehension: an increased Phonological Mapping Negativity (PMN) was observed when an unexpected (Incongruent) or unpredictable (Uninformative) phoneme is encountered. The retrieval of a referent’s semantics was indexed by an N400 effect in response to referents following both Incongruent and Uninformative gaze. Additionally, an increased P600 response was present only for preceding Incongruent gaze, indexing the revision process of the mental representation of the situation. The involvement of these mechanisms has been supported by the findings of the third experiment, in which linguistic content was presented to serve as a predictive cue for subsequent speaker gaze. In this experiment the sentence structure enabled participants to anticipate upcoming referents based on the preceding linguistic content. Thus, gaze cues preceding the mentioning of the referent could also be anticipated.

The results showed the involvement of the same mechanisms as in the first two experiments on the referent itself, only when preceding gaze was absent. In the presence of object-directed gaze, while there were no longer significant effects on the referent itself, effects of semantic retrieval (N400) and integration with sentence meaning (P3b) were found on the gaze cue. Effects in the P3b (Gaze) and P600 (Referent) time-window further provided support for the presence of a mechanism of monitoring of the mental representation of the situation that subsumes the integration into that representation: A positive deflection was found whenever the communicative signal completed the mental representation such that an evaluation of that representation was possible. Taken together, the results provide support for the view that speaker gaze, in situated communication, is interpreted as part of the communicative signal and incrementally used to inform the mental representation of the situation simultaneously with the linguistic signal and that the mental representation is utilized to generate expectations about upcoming referents in an unfolding utterance.

@phdthesis{Jachmann2020,
title = {The immediate influence of speaker gaze on situated speech comprehension : evidence from multiple ERP components},
author = {Torsten Jachmann},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-313090},
doi = {https://doi.org/10.22028/D291-31309},
year = {2020},
date = {2020},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {This thesis presents results from three ERP experiments on the influence of speaker gaze on listeners’ sentence comprehension with focus on the utilization of speaker gaze as part of the communicative signal. The first two experiments investigated whether speaker gaze was utilized in situated communication to form expectations about upcoming referents in an unfolding sentence. Participants were presented with a face performing gaze actions toward three objects surrounding it time aligned to utterances that compared two of the three objects. Participants were asked to judge whether the sentence they heard was true given the provided scene. Gaze cues preceded the naming of the corresponding object by 800ms. The gaze cue preceding the mentioning of the second object was manipulated such that it was either Congruent, Incongruent or Uninformative (Averted toward an empty position in experiment 1 and Mutual (redirected toward the listener) in Experiment 2). The results showed that speaker gaze was used to form expectations about the unfolding sentence indicated by three observed ERP components that index different underlying mechanisms of language comprehension: an increased Phonological Mapping Negativity (PMN) was observed when an unexpected (Incongruent) or unpredictable (Uninformative) phoneme is encountered. The retrieval of a referent’s semantics was indexed by an N400 effect in response to referents following both Incongruent and Uninformative gaze. Additionally, an increased P600 response was present only for preceding Incongruent gaze, indexing the revision process of the mental representation of the situation. The involvement of these mechanisms has been supported by the findings of the third experiment, in which linguistic content was presented to serve as a predictive cue for subsequent speaker gaze. In this experiment the sentence structure enabled participants to anticipate upcoming referents based on the preceding linguistic content. Thus, gaze cues preceding the mentioning of the referent could also be anticipated. The results showed the involvement of the same mechanisms as in the first two experiments on the referent itself, only when preceding gaze was absent. In the presence of object-directed gaze, while there were no longer significant effects on the referent itself, effects of semantic retrieval (N400) and integration with sentence meaning (P3b) were found on the gaze cue. Effects in the P3b (Gaze) and P600 (Referent) time-window further provided support for the presence of a mechanism of monitoring of the mental representation of the situation that subsumes the integration into that representation: A positive deflection was found whenever the communicative signal completed the mental representation such that an evaluation of that representation was possible. Taken together, the results provide support for the view that speaker gaze, in situated communication, is interpreted as part of the communicative signal and incrementally used to inform the mental representation of the situation simultaneously with the linguistic signal and that the mental representation is utilized to generate expectations about upcoming referents in an unfolding utterance.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C3

Meier, David; Andreeva, Bistra

Einflussfaktoren auf die Wahrnehmung von Prominenz im natürlichen Dialog Inproceedings

Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz , pp. 257-264, Magdeburg, 2020.

Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, nämlich die Phonologie, der Diskurskontext und das Wissen über den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu überprüfen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem natürlichsprachlichen Dialog vergleichbar ist. Für die Studie wurden 144 realisierte Sätze eines einzelnen männlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen größeren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verfügung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachlässigbare Effekte.

@inproceedings{Meier2020,
title = {Einflussfaktoren auf die Wahrnehmung von Prominenz im nat{\"u}rlichen Dialog},
author = {David Meier and Bistra Andreeva},
url = {https://www.essv.de/paper.php?id=465},
year = {2020},
date = {2020},
booktitle = {Elektronische Sprachsignalverarbeitung 2020, Tagungsband der 31. Konferenz},
pages = {257-264},
address = {Magdeburg},
abstract = {Turnbull et al. [1] stellen fest, dass sich auf die Wahrnehmung der prosodischen Prominenz von isolierten Adjektiv-Nomen-Paaren mehrere konkurrierende Faktoren auswirken, n{\"a}mlich die Phonologie, der Diskurskontext und das Wissen {\"u}ber den Diskurs. Der vorliegende Beitrag hat das Ziel, den relativen Einfluss der evozierten Fokussierung (eng kontrastiv vs. weit kontrastiv) und der Akzentuierung (akzentuiert vs. nicht akzentuiert) auf die Wahrnehmung von Prominenz zu untersuchen und zu {\"u}berpr{\"u}fen, ob die in Turnbull et al. vorgestellten Konzepte in einer Umgebung reproduzierbar sind, die eher mit einem nat{\"u}rlichsprachlichen Dialog vergleichbar ist. F{\"u}r die Studie wurden 144 realisierte S{\"a}tze eines einzelnen m{\"a}nnlichen Sprechers so zusammengeschnitten, dass ein semantischer Kontrast entweder auf dem betreffenden Nomen oder auf dem Adjektiv entsteht. Die metrisch starken Silben des Adjektivs oder des Nomens waren entweder entsprechend der Fokusstruktur oder gegen Erwartung akzentuiert. Die Ergebnisse zeigen, dass die Akzentuierung einen gr{\"o}{\ss}eren Einfluss auf die Prominenzwahrnehmung als die Fokusbedingung hat, was im Einklang mit den Ergebnissen von Turnbull et al. ist. Adjektive werden zudem konsequent als prominenter eingestuft als Nomen in vergleichbaren Kontexten. Eine Erweiterung des Diskurskontextes und der Hintergrundinformationen, die dem Versuchsteilnehmer zur Verf{\"u}gung standen, haben in dem hier vorgestellten Versuchsaufbau allerdings nur vernachl{\"a}ssigbare Effekte.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Andreeva, Bistra; Möbius, Bernd; Whang, James

Effects of surprisal and boundary strength on phrase-final lengthening Inproceedings

Proc. 10th International Conference on Speech Prosody 2020, pp. 146-150, 2020.

This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength.

ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.

@inproceedings{Andreeva2020,
title = {Effects of surprisal and boundary strength on phrase-final lengthening},
author = {Bistra Andreeva and Bernd M{\"o}bius andJames Whang},
url = {http://dx.doi.org/10.21437/SpeechProsody.2020-30},
year = {2020},
date = {2020-10-20},
booktitle = {Proc. 10th International Conference on Speech Prosody 2020},
pages = {146-150},
abstract = {This study examines the influence of prosodic structure (pitch accents and boundary strength) and information density (ID) on phrase-final syllable duration. Phrase-final syllable durations and following pause durations were measured in a subset of a German radio-news corpus (DIRNDL), consisting of about 5 hours of manually annotated speech. The prosodic annotation is in accordance with the autosegmental intonation model and includes labels for pitch accents and boundary tones. We treated pause duration as a quantitative proxy for boundary strength. ID was calculated as the surprisal of the syllable trigram of the preceding context, based on language models trained on the DeWaC corpus. We found a significant positive correlation between surprisal and phrase-final syllable duration. Syllable duration was statistically modeled as a function of prosodic factors (pitch accent and boundary strength) and surprisal in linear mixed effects models. The results revealed an interaction of surprisal and boundary strength with respect to phrase-final syllable duration. Syllables with high surprisal values are longer before stronger boundaries, whereas low-surprisal syllables are longer before weaker boundaries. This modulation of pre-boundary syllable duration is observed above and beyond the well-established phrase-final lengthening effect.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

Teich, Elke; Martínez Martínez, José; Karakanta, Alina

Translation, information theory and cognition Book Chapter

Alves, Fabio; Lykke Jakobsen, Arnt (Ed.): The Routledge Handbook of Translation and Cognition, Routledge, pp. 360-375, London, UK, 2020, ISBN 9781138037007.

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.

@inbook{Teich-etal2020-handbook,
title = {Translation, information theory and cognition},
author = {Elke Teich and Jos{\'e} Mart{\'i}nez Mart{\'i}nez and Alina Karakanta},
editor = {Fabio Alves and Arnt Lykke Jakobsen},
url = {https://www.taylorfrancis.com/chapters/edit/10.4324/9781315178127-24/translation-information-theory-cognition-elke-teich-josé-martínez-martínez-alina-karakanta},
year = {2020},
date = {2020},
booktitle = {The Routledge Handbook of Translation and Cognition},
isbn = {9781138037007},
pages = {360-375},
publisher = {Routledge},
address = {London, UK},
abstract = {

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.
},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Juzek, Tom; España-Bonet, Cristina; Dutta Chowdhury, Koel; van Genabith, Josef; Teich, Elke

How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech Inproceedings

The 17th International Workshop on Spoken Language Translation, Seattle, WA, United States, 2020.

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).

@inproceedings{Bizzoni2020,
title = {How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech},
author = {Yuri Bizzoni and Tom Juzek and Cristina Espa{\~n}a-Bonet and Koel Dutta Chowdhury and Josef van Genabith and Elke Teich},
url = {https://aclanthology.org/2020.iwslt-1.34/},
doi = {https://doi.org/10.18653/v1/2020.iwslt-1.34},
year = {2020},
date = {2020},
booktitle = {The 17th International Workshop on Spoken Language Translation},
address = {Seattle, WA, United States},
abstract = {Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B6 B7

Adelani, David; Hedderich, Michael; Zhu, Dawei; van Berg, Esther; Klakow, Dietrich

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá Miscellaneous

, 2020.

The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way.

Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios.

In this work, we perform named entity recognition for Hausa and Yorùbá, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier’s performance.

@miscellaneous{Adelani2020,
title = {Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùb{\'a}},
author = {David Adelani and Michael Hedderich and Dawei Zhu and Esther van Berg and Dietrich Klakow},
url = {https://arxiv.org/abs/2003.08370},
year = {2020},
date = {2020},
abstract = {The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way. Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios. In this work, we perform named entity recognition for Hausa and Yorùb{\'a}, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier's performance.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B4

Lemke, Tyll Robin; Schäfer, Lisa; Drenhaus, Heiner; Reich, Ingo

Script Knowledge Constrains Ellipses in Fragments - Evidence from Production Data and Language Modeling Inproceedings

Proceedings of the Society for Computation in Linguistics, 3, 2020.

We investigate the effect of script-based (Schank and Abelson 1977) extralinguistic context on the omission of words in fragments. Our data elicited with a production task show that predictable words are more often omitted than unpredictable ones, as predicted by the Uniform Information Density (UID) hypothesis (Levy and Jaeger, 2007).

We take into account effects of linguistic and extralinguistic context on predictability and propose a method for estimating the surprisal of words in presence of ellipsis. Our study extends previous evidence for UID in two ways: First, we show that not only local linguistic context, but also extralinguistic context determines the likelihood of omissions. Second, we find UID effects on the omission of content words.

@inproceedings{Lemke2020,
title = {Script Knowledge Constrains Ellipses in Fragments - Evidence from Production Data and Language Modeling},
author = {Tyll Robin Lemke and Lisa Sch{\"a}fer and Heiner Drenhaus and Ingo Reich},
url = {https://scholarworks.umass.edu/scil/vol3/iss1/45},
doi = {https://doi.org/https://doi.org/10.7275/mpby-zr74 },
year = {2020},
date = {2020},
booktitle = {Proceedings of the Society for Computation in Linguistics},
abstract = {We investigate the effect of script-based (Schank and Abelson 1977) extralinguistic context on the omission of words in fragments. Our data elicited with a production task show that predictable words are more often omitted than unpredictable ones, as predicted by the Uniform Information Density (UID) hypothesis (Levy and Jaeger, 2007). We take into account effects of linguistic and extralinguistic context on predictability and propose a method for estimating the surprisal of words in presence of ellipsis. Our study extends previous evidence for UID in two ways: First, we show that not only local linguistic context, but also extralinguistic context determines the likelihood of omissions. Second, we find UID effects on the omission of content words.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B3

Fischer, Stefan; Knappen, Jörg; Menzel, Katrin; Teich, Elke

The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study Inproceedings

Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, pp. 794-802, Marseille, France, 2020.

We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings.

The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases.

We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.

@inproceedings{fischer-EtAl:2020:LREC,
title = {The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study},
author = {Stefan Fischer and J{\"o}rg Knappen and Katrin Menzel and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.lrec-1.99/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
pages = {794-802},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Fankhauser, Peter; Teich, Elke

Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach Journal Article

Jurgens, David (Ed.): Frontiers in Artificial Intelligence, section Language and Computation, 2020.

We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665.

Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register).

Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.

@article{Bizzoni2020b,
title = {Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Peter Fankhauser and Elke Teich},
editor = {David Jurgens},
url = {https://www.frontiersin.org/articles/10.3389/frai.2020.00073/full},
doi = {https://doi.org/https://doi.org/10.3389/frai.2020.00073},
year = {2020},
date = {2020-10-18},
journal = {Frontiers in Artificial Intelligence, section Language and Computation},
abstract = {We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Wichlacz, Julia; Höller, Daniel; Torralba, Álvaro; Hoffmann, Jörg

Applying Monte-Carlo Tree Search in HTN Planning Inproceedings

Proceedings of the 13th International Symposium on Combinatorial Search (SoCS), AAAI Press, pp. 82-90, Vienna, Austria, 2020.

Search methods are useful in hierarchical task network (HTN) planning to make performance less dependent on the domain knowledge provided, and to minimize plan costs. Here we investigate Monte-Carlo tree search (MCTS) as a new algorithmic alternative in HTN planning. We implement combinations of MCTS with heuristic search in Panda. We furthermore investigate MCTS in JSHOP, to address lifted (non-grounded) planning, leveraging the fact that, in contrast to other search methods, MCTS does not require a grounded task representation. Our new methods yield coverage performance on par with the state of the art, but in addition can effectively minimize plan cost over time.

@inproceedings{Wichlacz20MCTSSOCS,
title = {Applying Monte-Carlo Tree Search in HTN Planning},
author = {Julia Wichlacz and Daniel H{\"o}ller and {\'A}lvaro Torralba and J{\"o}rg Hoffmann},
url = {https://ojs.aaai.org/index.php/SOCS/article/view/18538},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 13th International Symposium on Combinatorial Search (SoCS)},
pages = {82-90},
publisher = {AAAI Press},
address = {Vienna, Austria},
abstract = {Search methods are useful in hierarchical task network (HTN) planning to make performance less dependent on the domain knowledge provided, and to minimize plan costs. Here we investigate Monte-Carlo tree search (MCTS) as a new algorithmic alternative in HTN planning. We implement combinations of MCTS with heuristic search in Panda. We furthermore investigate MCTS in JSHOP, to address lifted (non-grounded) planning, leveraging the fact that, in contrast to other search methods, MCTS does not require a grounded task representation. Our new methods yield coverage performance on par with the state of the art, but in addition can effectively minimize plan cost over time.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A7

Höller, Daniel; Bercher, Pascal; Behnke, Gregor

Delete- and Ordering-Relaxation Heuristics for HTN Planning Inproceedings

Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), IJCAI organization, pp. 4076-4083, Yokohama, Japan, 2020.

In HTN planning, the hierarchy has a wide impact on solutions. First, there is (usually) no state-based goal given, the objective is given via the hierarchy. Second, it enforces actions to be in a plan. Third, planners are not allowed to add actions apart from those introduced via decomposition, i.e. via the hierarchy. However, no heuristic considers the interplay of hierarchy and actions in the plan exactly (without relaxation) because this makes heuristic calculation NP-hard even under delete relaxation. We introduce the problem class of delete- and ordering-free HTN planning as basis for novel HTN heuristics and show that its plan existence problem is still NP-complete. We then introduce heuristics based on the new class using an integer programming model to solve it.

@inproceedings{Hoeller2020IJCAI,
title = {Delete- and Ordering-Relaxation Heuristics for HTN Planning},
author = {Daniel H{\"o}ller and Pascal Bercher and Gregor Behnke},
url = {https://www.ijcai.org/proceedings/2020/564},
doi = {https://doi.org/10.24963/ijcai.2020/564},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI)},
pages = {4076-4083},
publisher = {IJCAI organization},
address = {Yokohama, Japan},
abstract = {In HTN planning, the hierarchy has a wide impact on solutions. First, there is (usually) no state-based goal given, the objective is given via the hierarchy. Second, it enforces actions to be in a plan. Third, planners are not allowed to add actions apart from those introduced via decomposition, i.e. via the hierarchy. However, no heuristic considers the interplay of hierarchy and actions in the plan exactly (without relaxation) because this makes heuristic calculation NP-hard even under delete relaxation. We introduce the problem class of delete- and ordering-free HTN planning as basis for novel HTN heuristics and show that its plan existence problem is still NP-complete. We then introduce heuristics based on the new class using an integer programming model to solve it.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A7

Ryzhova, Margarita; Demberg, Vera

Processing particularized pragmatic inferences under load Inproceedings

Proceedings of the 42nd Annual Meeting of the Cognitive Science Society (CogSci 2020), 2020.

A long-standing question in language understanding is whether pragmatic inferences are effortful or whether they happen seamlessly without measurable cognitive effort. We here measure the strength of particularized pragmatic inferences in a setting with high vs. low cognitive load. Cognitive load is induced by a secondary dot tracking task.

If this type of pragmatic inference comes at no cognitive processing cost, inferences should be similarly strong in both the high and the low load condition. If they are effortful, we expect a smaller effect size in the dual tasking condition. Our results show that participants who have difficulty in dual tasking (as evidenced by incorrect answers to comprehension questions) exhibit a smaller pragmatic effect when they were distracted with a secondary task in comparison to the single task condition. This finding supports the idea that pragmatic inferences are effortful.

@inproceedings{Ryzhova2020,
title = {Processing particularized pragmatic inferences under load},
author = {Margarita Ryzhova and Vera Demberg},
url = {https://www.semanticscholar.org/paper/Processing-particularized-pragmatic-inferences-load-Ryzhova-Demberg/a5b8d4c72590eaaf965d91d8fafa2495f680313d},
year = {2020},
date = {2020-10-17},
booktitle = {Proceedings of the 42nd Annual Meeting of the Cognitive Science Society (CogSci 2020)},
abstract = {A long-standing question in language understanding is whether pragmatic inferences are effortful or whether they happen seamlessly without measurable cognitive effort. We here measure the strength of particularized pragmatic inferences in a setting with high vs. low cognitive load. Cognitive load is induced by a secondary dot tracking task. If this type of pragmatic inference comes at no cognitive processing cost, inferences should be similarly strong in both the high and the low load condition. If they are effortful, we expect a smaller effect size in the dual tasking condition. Our results show that participants who have difficulty in dual tasking (as evidenced by incorrect answers to comprehension questions) exhibit a smaller pragmatic effect when they were distracted with a secondary task in comparison to the single task condition. This finding supports the idea that pragmatic inferences are effortful.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A3

Scholman, Merel; Demberg, Vera; Sanders, Ted J. M.

Individual differences in expecting coherence relations: Exploring the variability in sensitivity to contextual signals in discourse Journal Article

Discourse Processes, 57, pp. 844-861, 2020.

The current study investigated how a contextual list signal influences comprehenders’ inference generation of upcoming discourse relations and whether individual differences in working memory capacity and linguistic experience influence the generation of these inferences. Participants were asked to complete two-sentence stories, the first sentence of which contained an expression of quantity (a few, multiple). Several individual-difference measures were calculated to explore whether individual characteristics can explain the sensitivity to the contextual list signal. The results revealed that participants were sensitive to a contextual list signal (i.e., they provided list continuations), and this sensitivity was modulated by the participants’ linguistic experience, as measured by an author recognition test. The results showed no evidence that working memory affected participants’ responses. These results extend prior research by showing that contextual signals influence participants’ coherence-relation-inference generation. Further, the results of the current study emphasize the importance of individual reader characteristics when it comes to coherence-relation inferences.

@article{Scholman2020,
title = {Individual differences in expecting coherence relations: Exploring the variability in sensitivity to contextual signals in discourse},
author = {Merel Scholman and Vera Demberg and Ted J. M. Sanders},
url = {https://www.tandfonline.com/doi/full/10.1080/0163853X.2020.1813492},
doi = {https://doi.org/10.1080/0163853X.2020.1813492},
year = {2020},
date = {2020-10-02},
journal = {Discourse Processes},
pages = {844-861},
volume = {57},
number = {10},
abstract = {The current study investigated how a contextual list signal influences comprehenders’ inference generation of upcoming discourse relations and whether individual differences in working memory capacity and linguistic experience influence the generation of these inferences. Participants were asked to complete two-sentence stories, the first sentence of which contained an expression of quantity (a few, multiple). Several individual-difference measures were calculated to explore whether individual characteristics can explain the sensitivity to the contextual list signal. The results revealed that participants were sensitive to a contextual list signal (i.e., they provided list continuations), and this sensitivity was modulated by the participants’ linguistic experience, as measured by an author recognition test. The results showed no evidence that working memory affected participants’ responses. These results extend prior research by showing that contextual signals influence participants’ coherence-relation-inference generation. Further, the results of the current study emphasize the importance of individual reader characteristics when it comes to coherence-relation inferences.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B2

Brouwer, Harm; Delogu, Francesca; Crocker, Matthew W.

Splitting event‐related potentials: Modeling latent components using regression‐based waveform estimation Journal Article

European Journal of Neuroscience, 2020.

Event‐related potentials (ERPs) provide a multidimensional and real‐time window into neurocognitive processing. The typical Waveform‐based Component Structure (WCS) approach to ERPs assesses the modulation pattern of components—systematic, reoccurring voltage fluctuations reflecting specific computational operations—by looking at mean amplitude in predetermined time‐windows.

This WCS approach, however, often leads to inconsistent results within as well as across studies. It has been argued that at least some inconsistencies may be reconciled by considering spatiotemporal overlap between components; that is, components may overlap in both space and time, and given their additive nature, this means that the WCS may fail to accurately represent its underlying latent component structure (LCS). We employ regression‐based ERP (rERP) estimation to extend traditional approaches with an additional layer of analysis, which enables the explicit modeling of the LCS underlying WCS. To demonstrate its utility, we incrementally derive an rERP analysis of a recent study on language comprehension with seemingly inconsistent WCS‐derived results.

Analysis of the resultant regression models allows one to derive an explanation for the WCS in terms of how relevant regression predictors combine in space and time, and crucially, how individual predictors may be mapped onto unique components in LCS, revealing how these spatiotemporally overlap in the WCS. We conclude that rERP estimation allows for investigating how scalp‐recorded voltages derive from the spatiotemporal combination of experimentally manipulated factors. Moreover, when factors can be uniquely mapped onto components, rERPs may offer explanations for seemingly inconsistent ERP waveforms at the level of their underlying latent component structure.

@article{Brouwer2020,
title = {Splitting event‐related potentials: Modeling latent components using regression‐based waveform estimation},
author = {Harm Brouwer and Francesca Delogu and Matthew W. Crocker},
url = {https://onlinelibrary.wiley.com/doi/10.1111/ejn.14961},
doi = {https://doi.org/10.1111/ejn.14961},
year = {2020},
date = {2020-09-08},
journal = {European Journal of Neuroscience},
abstract = {Event‐related potentials (ERPs) provide a multidimensional and real‐time window into neurocognitive processing. The typical Waveform‐based Component Structure (WCS) approach to ERPs assesses the modulation pattern of components—systematic, reoccurring voltage fluctuations reflecting specific computational operations—by looking at mean amplitude in predetermined time‐windows. This WCS approach, however, often leads to inconsistent results within as well as across studies. It has been argued that at least some inconsistencies may be reconciled by considering spatiotemporal overlap between components; that is, components may overlap in both space and time, and given their additive nature, this means that the WCS may fail to accurately represent its underlying latent component structure (LCS). We employ regression‐based ERP (rERP) estimation to extend traditional approaches with an additional layer of analysis, which enables the explicit modeling of the LCS underlying WCS. To demonstrate its utility, we incrementally derive an rERP analysis of a recent study on language comprehension with seemingly inconsistent WCS‐derived results. Analysis of the resultant regression models allows one to derive an explanation for the WCS in terms of how relevant regression predictors combine in space and time, and crucially, how individual predictors may be mapped onto unique components in LCS, revealing how these spatiotemporally overlap in the WCS. We conclude that rERP estimation allows for investigating how scalp‐recorded voltages derive from the spatiotemporal combination of experimentally manipulated factors. Moreover, when factors can be uniquely mapped onto components, rERPs may offer explanations for seemingly inconsistent ERP waveforms at the level of their underlying latent component structure.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A1

Dutta Chowdhury, Koel; España-Bonet, Cristina; van Genabith, Josef

Understanding Translationese in Multi-view Embedding Spaces Inproceedings

Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, pp. 6056-6062, Barcelona, Catalonia (Online), 2020.

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {—} words, parts of speech, semantic tags and synsets {—} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just“ due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

@inproceedings{DuttaEtal:COLING:2020,
title = {Understanding Translationese in Multi-view Embedding Spaces},
author = {Koel Dutta Chowdhury and Cristina Espa{\~n}a-Bonet and Josef van Genabith},
url = {https://www.aclweb.org/anthology/2020.coling-main.532/},
doi = {https://doi.org/10.18653/v1/2020.coling-main.532},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
pages = {6056-6062},
publisher = {International Committee on Computational Linguistics},
address = {Barcelona, Catalonia (Online)},
abstract = {Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {---} words, parts of speech, semantic tags and synsets {---} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just“ due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Hedderich, Michael; Adelani, David; Zhu, Dawei; Jesujoba , Alabi; Udia, Markus; Klakow, Dietrich

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages Inproceedings

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 2580-2591, 2020.

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.

@inproceedings{hedderich-etal-2020-transfer,
title = {Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages},
author = {Michael Hedderich and David Adelani and Dawei Zhu and Alabi Jesujoba and Markus Udia and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.emnlp-main.204},
doi = {https://doi.org/10.18653/v1/2020.emnlp-main.204},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages = {2580-2591},
publisher = {Association for Computational Linguistics},
abstract = {Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Successfully