Publications

Greenberg, Clayton

Evaluating humanness in language models PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Advances with language models, systems that predict upcoming words in context, have enabled an era in which people sometimes cannot distinguish between human-written and artificially created text. Perplexity, the simplest and most popular way to evaluate the quality of a language model, rewards any pattern captured by the system as long as it robustly constrains the upcoming possibilities. By capturing patterns that humans do not use, optimizing a language model for minimal perplexity could trigger a divergence between the most probable text and the most human-like text. In this thesis, I argue that this divergence has happened for state-of-the-art language models. Part I characterizes the kinds of knowledge captured by language models. First, I present three novel language model architectures whose neural connections were inspired by human behavior. Then, I discuss novel morphology- and sentiment-based paradigms that capture human knowledge quantitatively. Part II establishes several methods for evaluating language models by comparison against human behavior measures. I consider the suitability and potential confounds for offline ratings and two paradigms of online reading times: eye-tracking and G-Maze. Then, I use a novel dataset of G-Maze response times to show computational and linguistic evidence of the divergence.


Fortschritte bei Sprachmodellen (LMs) – Systeme, die aus dem Kontext heraus nachfolgende Worte vorhersagen – haben dazu geführt, dass Menschen manchmal nicht mehr zwischen von Menschen geschriebenem und künstlich erzeugtem Text unterscheiden können. Perplexität (PPL), die einfachste und beliebteste Methode zur Bewertung der Qualität eines LM, belohnt jedes vom System erfasste Muster, solange es die kommenden Möglichkeiten stark einschränkt. Durch die Erfassung von Mustern, die Menschen nicht verwenden, könnte die Optimierung eines LM hinsichtlich minimaler PPL zu einer Divergenz zwischen dem wahrscheinlichsten Text und dem menschenähnlichsten Text führen. In dieser Arbeit wird argumentiert, dass diese Divergenz bei modernen LMs aufgetreten ist. Teil I charakterisiert die Arten von Wissen, die von LMs erfasst werden. Zuerst werden drei neue LM-Architekturen beschreiben, deren neuronale Verbindungen von menschlichem Verhalten inspiriert wurden. Danach werden neuartige morphologie- und sentiment-basierte Paradigmen diskutiert, die menschliches Verhalten quantitativ erfassen. In Teil II werden mehrere Methoden entwickelt, die LMs durch Vergleich mit menschlichen Verhaltensmaßen bewerten. Diskutiert werden die Eignung und mögliche Störfaktoren für Offline-Bewertungen und zwei Paradigmen von Online-Lesezeiten: Eye-Tracking und G-Maze. Ein neuartiger Datensatz der G-Maze-Antwortzeiten wird dazu verwendet, um rechnerische und sprachliche Beweise für die Divergenz zu liefern.

@phdthesis{Greenberg_Diss,
title = {Evaluating humanness in language models},
author = {Clayton Greenberg},
url = {https://jahrbib.sulb.uni-saarland.de/handle/20.500.11880/37534},
doi = {https://doi.org/10.22028/D291-41943},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Advances with language models, systems that predict upcoming words in context, have enabled an era in which people sometimes cannot distinguish between human-written and artificially created text. Perplexity, the simplest and most popular way to evaluate the quality of a language model, rewards any pattern captured by the system as long as it robustly constrains the upcoming possibilities. By capturing patterns that humans do not use, optimizing a language model for minimal perplexity could trigger a divergence between the most probable text and the most human-like text. In this thesis, I argue that this divergence has happened for state-of-the-art language models. Part I characterizes the kinds of knowledge captured by language models. First, I present three novel language model architectures whose neural connections were inspired by human behavior. Then, I discuss novel morphology- and sentiment-based paradigms that capture human knowledge quantitatively. Part II establishes several methods for evaluating language models by comparison against human behavior measures. I consider the suitability and potential confounds for offline ratings and two paradigms of online reading times: eye-tracking and G-Maze. Then, I use a novel dataset of G-Maze response times to show computational and linguistic evidence of the divergence.


Fortschritte bei Sprachmodellen (LMs) - Systeme, die aus dem Kontext heraus nachfolgende Worte vorhersagen - haben dazu gef{\"u}hrt, dass Menschen manchmal nicht mehr zwischen von Menschen geschriebenem und k{\"u}nstlich erzeugtem Text unterscheiden k{\"o}nnen. Perplexit{\"a}t (PPL), die einfachste und beliebteste Methode zur Bewertung der Qualit{\"a}t eines LM, belohnt jedes vom System erfasste Muster, solange es die kommenden M{\"o}glichkeiten stark einschr{\"a}nkt. Durch die Erfassung von Mustern, die Menschen nicht verwenden, k{\"o}nnte die Optimierung eines LM hinsichtlich minimaler PPL zu einer Divergenz zwischen dem wahrscheinlichsten Text und dem menschen{\"a}hnlichsten Text f{\"u}hren. In dieser Arbeit wird argumentiert, dass diese Divergenz bei modernen LMs aufgetreten ist. Teil I charakterisiert die Arten von Wissen, die von LMs erfasst werden. Zuerst werden drei neue LM-Architekturen beschreiben, deren neuronale Verbindungen von menschlichem Verhalten inspiriert wurden. Danach werden neuartige morphologie- und sentiment-basierte Paradigmen diskutiert, die menschliches Verhalten quantitativ erfassen. In Teil II werden mehrere Methoden entwickelt, die LMs durch Vergleich mit menschlichen Verhaltensma{\ss}en bewerten. Diskutiert werden die Eignung und m{\"o}gliche St{\"o}rfaktoren f{\"u}r Offline-Bewertungen und zwei Paradigmen von Online-Lesezeiten: Eye-Tracking und G-Maze. Ein neuartiger Datensatz der G-Maze-Antwortzeiten wird dazu verwendet, um rechnerische und sprachliche Beweise f{\"u}r die Divergenz zu liefern.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Zhang, Miaoran; Mingyang, Wang; Jesujoba , Alabi; Klakow, Dietrich

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness Inproceedings

Kr. Ojha, Atul; Seza Doğruöz, A.; Tayyar Madabushi, Harish; Da San Martino, Giovanni; Rosenthal, Sara; Rosá, Aiala (Ed.): Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, pp. 800-810, Mexico City, Mexico, 2024.

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

@inproceedings{zhang2024aadamsemeval2024task1,
title = {AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness},
author = {Miaoran Zhang and Wang Mingyang and Alabi Jesujoba and Dietrich Klakow},
editor = {Atul Kr. Ojha and A. Seza Doğru{\"o}z and Harish Tayyar Madabushi and Giovanni Da San Martino and Sara Rosenthal and Aiala Ros{\'a}},
url = {https://aclanthology.org/2024.semeval-1.114},
doi = {https://doi.org/10.18653/v1/2024.semeval-1.114},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)},
pages = {800-810},
publisher = {Association for Computational Linguistics},
address = {Mexico City, Mexico},
abstract = {This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius

Analyzing pre-trained and fine-tuned language models PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

The field of natural language processing (NLP) has recently undergone a paradigm shift. Since the introduction of transformer-based language models in 2018, the current generation of natural language processing models continues to demonstrate impressive capabilities on a variety of academic benchmarks and real-world applications. This paradigm shift is based on a simple but general pipeline which consists of pre-training neural language models on large quantities of text, followed by an adaptation step that fine-tunes the pre-trained model to perform a specific NLP task of interest. Despite the impressive progress on academic benchmarks and the widespread deployment of pre-trained and fine-tuned language models in industry, these models do not come without shortcomings which often have immediate consequences for the robustness and generalization of fine-tuned language models. Moreover, these shortcomings demonstrate that we still lack a fundamental understanding of how and why pre-trained and fine-tuned language models work as well as the individual steps of the pipeline that produce them. This thesis makes several contributions towards improving our understanding of pre-trained and fine-tuned language models by carrying out a detailed analysis of various parts of the modern NLP pipeline. Our contributions range from analyzing the linguistic knowledge of pre-trained language models and how it is affected by fine-tuning, to a rigorous analysis of the fine-tuning process itself and how the choice of adaptation technique affects the generalization of models. Overall, we provide new insights about previously unexplained phenomena and the capabilities of pre-trained and fine-tuned language models.


Im Bereich der Verarbeitung natürlicher Sprache (NLP) hat sich ein Paradigmenwechsel vollzogen. Seit der Einführung von transformer-basierten Sprachmodellen im Jahr 2018 zeigt die aktuelle Generation neuronaler Sprachverarbeitungsmodelle beeindruckende Fähigkeiten bei einer Vielzahl von akademischen Benchmarks und realen Anwendungen. Dieser Paradigmenwechsel basiert auf einer einfachen, aber allgemeinen Pipeline, die aus dem Vortrainieren von neuronalen Sprachmodellen auf großen Textmengen besteht, gefolgt von einem Anpassungsschritt, der das vortrainierte Modell modifiziert, um eine bestimmte NLP-Aufgabe durchzuführen. Trotz des beeindruckenden Fortschritts bei akademischen Benchmarks und des weit verbreiteten Einsatzes von vortrainierten und angepassten Sprachmodellen in der Industrie sind diese Modelle nicht ohne Mängel, und oft haben diese Mängel unmittelbare Auswirkungen auf die Robustheit und Generalisierung der Sprachmodelle. Darüber hinaus zeigen sie, dass uns einerseits noch immer ein grundlegendes Verständnis dafür fehlt, wie und warum vortrainierte und angepasste Sprachmodelle funktionieren, andererseits fehlt ein grundlegendes Verständnis der einzelnen Schritte der Pipeline. Diese Arbeit leistet mehrere Beiträge zur Verbesserung unseres Verständnisses von vortrainierten und angepassten Sprachmodellen, indem sie eine detaillierte Analyse verschiedener Teile der modernen NLP-Pipeline durchführt. Unsere Beiträge reichen von der Analyse des linguistischen Wissens von vortrainierten Sprachmodellen und wie dieses durch die Anpassung beeinflusst wird bis hin zu einer rigorosen Analyse des Anpassungsprozesses selbst und wie die Wahl der Anpassungstechnik die Generalisierung von Modellen beeinflusst, und liefern insgesamt neue Erkenntnisse über bisher unerklärte Phänomene und Fähigkeiten von vortrainierten und angepassten Sprachmodellen.

@phdthesis{Mosbach-2024-Thesis,
title = {Analyzing pre-trained and fine-tuned language models},
author = {Marius Mosbach},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/37254},
doi = {https://doi.org/10.22028/D291-41531},
year = {2024},
date = {2024-02-19},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {The field of natural language processing (NLP) has recently undergone a paradigm shift. Since the introduction of transformer-based language models in 2018, the current generation of natural language processing models continues to demonstrate impressive capabilities on a variety of academic benchmarks and real-world applications. This paradigm shift is based on a simple but general pipeline which consists of pre-training neural language models on large quantities of text, followed by an adaptation step that fine-tunes the pre-trained model to perform a specific NLP task of interest. Despite the impressive progress on academic benchmarks and the widespread deployment of pre-trained and fine-tuned language models in industry, these models do not come without shortcomings which often have immediate consequences for the robustness and generalization of fine-tuned language models. Moreover, these shortcomings demonstrate that we still lack a fundamental understanding of how and why pre-trained and fine-tuned language models work as well as the individual steps of the pipeline that produce them. This thesis makes several contributions towards improving our understanding of pre-trained and fine-tuned language models by carrying out a detailed analysis of various parts of the modern NLP pipeline. Our contributions range from analyzing the linguistic knowledge of pre-trained language models and how it is affected by fine-tuning, to a rigorous analysis of the fine-tuning process itself and how the choice of adaptation technique affects the generalization of models. Overall, we provide new insights about previously unexplained phenomena and the capabilities of pre-trained and fine-tuned language models.


Im Bereich der Verarbeitung nat{\"u}rlicher Sprache (NLP) hat sich ein Paradigmenwechsel vollzogen. Seit der Einf{\"u}hrung von transformer-basierten Sprachmodellen im Jahr 2018 zeigt die aktuelle Generation neuronaler Sprachverarbeitungsmodelle beeindruckende F{\"a}higkeiten bei einer Vielzahl von akademischen Benchmarks und realen Anwendungen. Dieser Paradigmenwechsel basiert auf einer einfachen, aber allgemeinen Pipeline, die aus dem Vortrainieren von neuronalen Sprachmodellen auf gro{\ss}en Textmengen besteht, gefolgt von einem Anpassungsschritt, der das vortrainierte Modell modifiziert, um eine bestimmte NLP-Aufgabe durchzuf{\"u}hren. Trotz des beeindruckenden Fortschritts bei akademischen Benchmarks und des weit verbreiteten Einsatzes von vortrainierten und angepassten Sprachmodellen in der Industrie sind diese Modelle nicht ohne M{\"a}ngel, und oft haben diese M{\"a}ngel unmittelbare Auswirkungen auf die Robustheit und Generalisierung der Sprachmodelle. Dar{\"u}ber hinaus zeigen sie, dass uns einerseits noch immer ein grundlegendes Verst{\"a}ndnis daf{\"u}r fehlt, wie und warum vortrainierte und angepasste Sprachmodelle funktionieren, andererseits fehlt ein grundlegendes Verst{\"a}ndnis der einzelnen Schritte der Pipeline. Diese Arbeit leistet mehrere Beitr{\"a}ge zur Verbesserung unseres Verst{\"a}ndnisses von vortrainierten und angepassten Sprachmodellen, indem sie eine detaillierte Analyse verschiedener Teile der modernen NLP-Pipeline durchf{\"u}hrt. Unsere Beitr{\"a}ge reichen von der Analyse des linguistischen Wissens von vortrainierten Sprachmodellen und wie dieses durch die Anpassung beeinflusst wird bis hin zu einer rigorosen Analyse des Anpassungsprozesses selbst und wie die Wahl der Anpassungstechnik die Generalisierung von Modellen beeinflusst, und liefern insgesamt neue Erkenntnisse {\"u}ber bisher unerkl{\"a}rte Ph{\"a}nomene und F{\"a}higkeiten von vortrainierten und angepassten Sprachmodellen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Steuer, Julius; Mosbach, Marius; Klakow, Dietrich

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures Inproceedings

Warstadt, Alex; Mueller, Aaron; Choshen, Leshem; Wilcox, Ethan; Zhuang, Chengxu; Ciro, Juan; Rafael, Mosquera; Paranjabe, Bhargavi; Williams, Adina; Linzen, Tal; Cotterell, Ryan (Ed.): Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Association for Computational Linguistics, pp. 142-157, Singapore, 2023.

Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.

@inproceedings{steuer-etal-2023-large,
title = {Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures},
author = {Julius Steuer and Marius Mosbach and Dietrich Klakow},
editor = {Alex Warstadt and Aaron Mueller and Leshem Choshen and Ethan Wilcox and Chengxu Zhuang and Juan Ciro and Mosquera Rafael and Bhargavi Paranjabe and Adina Williams and Tal Linzen and Ryan Cotterell},
url = {https://aclanthology.org/2023.conll-babylm.12/},
doi = {https://doi.org/10.18653/v1/2023.conll-babylm.12},
year = {2023},
date = {2023},
booktitle = {Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning},
pages = {142-157},
publisher = {Association for Computational Linguistics},
address = {Singapore},
abstract = {Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Gautam, Vagrant; Zhang, Miaoran; Klakow, Dietrich

A Lightweight Method to Generate Unanswerable Questions in English Inproceedings

Bouamor, Houda; Pino, Juan; Bali, Kalika (Ed.): Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, pp. 7349-7360, Singapore, 2023.

If a question cannot be answered with the available information, robust systems for question answering (QA) should know *not* to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.

@inproceedings{gautam-etal-2023-lightweight,
title = {A Lightweight Method to Generate Unanswerable Questions in English},
author = {Vagrant Gautam and Miaoran Zhang and Dietrich Klakow},
editor = {Houda Bouamor and Juan Pino and Kalika Bali},
url = {https://aclanthology.org/2023.findings-emnlp.491},
doi = {https://doi.org/10.18653/v1/2023.findings-emnlp.491},
year = {2023},
date = {2023},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023},
pages = {7349-7360},
publisher = {Association for Computational Linguistics},
address = {Singapore},
abstract = {If a question cannot be answered with the available information, robust systems for question answering (QA) should know *not* to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhu, Dawei; Shen, Xiaoyu; Mosbach, Marius; Stephan, Andreas; Klakow, Dietrich

Weaker Than You Think: A Critical Look at Weakly Supervised Learning Inproceedings

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 14229-14253, Toronto, Canada, 2023.

Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.

@inproceedings{zhu-etal-2023-weaker,
title = {Weaker Than You Think: A Critical Look at Weakly Supervised Learning},
author = {Dawei Zhu and Xiaoyu Shen and Marius Mosbach and Andreas Stephan and Dietrich Klakow},
url = {https://aclanthology.org/2023.acl-long.796},
doi = {https://doi.org/10.18653/v1/2023.acl-long.796},
year = {2023},
date = {2023-09-21},
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages = {14229-14253},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
abstract = {Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius; Pimentel, Tiago; Ravfogel, Shauli; Klakow, Dietrich; Elazar, Yanai

Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation Inproceedings

Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, pp. 12284-12314, Toronto, Canada, 2023.

Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations.Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.

@inproceedings{mosbach-etal-2023-shot,
title = {Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation},
author = {Marius Mosbach and Tiago Pimentel and Shauli Ravfogel and Dietrich Klakow and Yanai Elazar},
url = {https://aclanthology.org/2023.findings-acl.779},
doi = {https://doi.org/10.18653/v1/2023.findings-acl.779},
year = {2023},
date = {2023},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
pages = {12284-12314},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
abstract = {Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations.Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Steuer, Julius; Abdullah, Badr M.; List, Johann-Mattis; Klakow, Dietrich

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists Inproceedings

Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 96-109, Dubrovnik, Croatia, 2023.

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

@inproceedings{steuer-etal-2023-information,
title = {Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists},
author = {Julius Steuer and Badr M. Abdullah and Johann-Mattis List and Dietrich Klakow},
url = {https://aclanthology.org/2023.sigtyp-1.10},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {96-109},
publisher = {Association for Computational Linguistics},
address = {Dubrovnik, Croatia},
abstract = {We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Hedderich, Michael

Weak supervision and label noise handling for Natural language processing in low-resource scenarios PhD Thesis

Saarland University, Saarbruecken, Germany, 2022.

The lack of large amounts of labeled data is a significant factor blocking many low-resource languages and domains from catching up with recent advancements in natural language processing. To reduce this dependency on labeled instances, weak supervision (semi-)automatically annotates unlabeled data. These labels can be obtained more quickly and cheaply than manual, gold-standard annotations. They also, however, contain more errors. Handling these noisy labels is often required to leverage the weakly supervised data successfully. In this dissertation, we study the whole weak supervision pipeline with a focus on the task of named entity recognition. We develop a tool for automatic annotation, and we propose an approach to model label noise when a small amount of clean data is available. We study the factors that influence the noise model’s quality from a theoretic perspective, and we validate this approach empirically on several different tasks and languages. An important aspect is the aim for a realistic evaluation. We perform our analysis, among others, on several African low-resource languages. We show the performance benefits that can be achieved using weak supervision and label noise modeling. But we also highlight open issues that the field still has to overcome. For the low-resource settings, we expand the analysis to few-shot learning. For classification errors, we present a novel approach to obtain interpretable insights of where classifiers fail.


Der Mangel an annotierten Daten ist ein wesentlicher Faktor, der viele Sprachen und Domänen mit geringen Ressourcen daran hindert, mit den jüngsten Fortschritten in der digitalen Textverarbeitung Schritt zu halten. Um diese Abhängigkeit von gelabelten Trainingsdaten zu verringern, werden bei Weak Supervision nicht gelabelte Daten (halb-)automatisch annotiert. Diese Annotationen sind schneller und günstiger zu erhalten. Sie enthalten jedoch auch mehr Fehler. Oft ist eine besondere Behandlung dieser Noisy Labels notwendig, um die Daten erfolgreich nutzen zu können. In dieser Dissertation untersuchen wir die gesamte Weak Supervision Pipeline mit einem Schwerpunkt auf den Einsatz für die Erkennung von Entitäten. Wir entwickeln ein Tool zur automatischen Annotation und präsentieren einen neuen Ansatz zur Modellierung von Noisy Labels. Wir untersuchen die Faktoren, die die Qualität dieses Modells aus theoretischer Sicht beeinflussen, und wir validieren den Ansatz empirisch für verschiedene Aufgaben und Sprachen. Ein wichtiger Aspekt dieser Arbeit ist das Ziel einer realistischen Analyse. Die Untersuchung führen wir unter anderem an mehreren afrikanischen Sprachen durch und zeigen die Leistungsvorteile, die durch Weak Supervision und die Modellierung von Label Noise erreicht werden können. Auch erweitern wir die Analyse auf das Lernen mit wenigen Beispielen. In Bezug auf Klassifizierungsfehler, stellen wir zudem einen neuen Ansatz vor, um interpretierbare Erkenntnisse zu gewinnen.

@phdthesis{Hedderich_Diss_2022,
title = {Weak supervision and label noise handling for Natural language processing in low-resource scenarios},
author = {Michael Hedderich},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/35026},
doi = {https://doi.org/10.22028/D291-38691},
year = {2022},
date = {2022},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {The lack of large amounts of labeled data is a significant factor blocking many low-resource languages and domains from catching up with recent advancements in natural language processing. To reduce this dependency on labeled instances, weak supervision (semi-)automatically annotates unlabeled data. These labels can be obtained more quickly and cheaply than manual, gold-standard annotations. They also, however, contain more errors. Handling these noisy labels is often required to leverage the weakly supervised data successfully. In this dissertation, we study the whole weak supervision pipeline with a focus on the task of named entity recognition. We develop a tool for automatic annotation, and we propose an approach to model label noise when a small amount of clean data is available. We study the factors that influence the noise model's quality from a theoretic perspective, and we validate this approach empirically on several different tasks and languages. An important aspect is the aim for a realistic evaluation. We perform our analysis, among others, on several African low-resource languages. We show the performance benefits that can be achieved using weak supervision and label noise modeling. But we also highlight open issues that the field still has to overcome. For the low-resource settings, we expand the analysis to few-shot learning. For classification errors, we present a novel approach to obtain interpretable insights of where classifiers fail.


Der Mangel an annotierten Daten ist ein wesentlicher Faktor, der viele Sprachen und Dom{\"a}nen mit geringen Ressourcen daran hindert, mit den j{\"u}ngsten Fortschritten in der digitalen Textverarbeitung Schritt zu halten. Um diese Abh{\"a}ngigkeit von gelabelten Trainingsdaten zu verringern, werden bei Weak Supervision nicht gelabelte Daten (halb-)automatisch annotiert. Diese Annotationen sind schneller und g{\"u}nstiger zu erhalten. Sie enthalten jedoch auch mehr Fehler. Oft ist eine besondere Behandlung dieser Noisy Labels notwendig, um die Daten erfolgreich nutzen zu k{\"o}nnen. In dieser Dissertation untersuchen wir die gesamte Weak Supervision Pipeline mit einem Schwerpunkt auf den Einsatz f{\"u}r die Erkennung von Entit{\"a}ten. Wir entwickeln ein Tool zur automatischen Annotation und pr{\"a}sentieren einen neuen Ansatz zur Modellierung von Noisy Labels. Wir untersuchen die Faktoren, die die Qualit{\"a}t dieses Modells aus theoretischer Sicht beeinflussen, und wir validieren den Ansatz empirisch f{\"u}r verschiedene Aufgaben und Sprachen. Ein wichtiger Aspekt dieser Arbeit ist das Ziel einer realistischen Analyse. Die Untersuchung f{\"u}hren wir unter anderem an mehreren afrikanischen Sprachen durch und zeigen die Leistungsvorteile, die durch Weak Supervision und die Modellierung von Label Noise erreicht werden k{\"o}nnen. Auch erweitern wir die Analyse auf das Lernen mit wenigen Beispielen. In Bezug auf Klassifizierungsfehler, stellen wir zudem einen neuen Ansatz vor, um interpretierbare Erkenntnisse zu gewinnen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Jesujoba , Alabi; Adelani, David; Mosbach, Marius; Klakow, Dietrich

Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning Inproceedings

Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, pp. 4336-4349, Gyeongju, Republic of Korea, 2022.

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {—} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.

@inproceedings{alabi-etal-2022-adapting,
title = {Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning},
author = {Alabi Jesujoba and David Adelani and Marius Mosbach and Dietrich Klakow},
url = {https://aclanthology.org/2022.coling-1.382},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 29th International Conference on Computational Linguistics},
pages = {4336-4349},
publisher = {International Committee on Computational Linguistics},
address = {Gyeongju, Republic of Korea},
abstract = {Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhang, Miaoran; Mosbach, Marius; Adelani, David; Hedderich, Michael; Klakow, Dietrich

MCSE: Multimodal Contrastive Learning of Sentence Embeddings Inproceedings

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 5959-5969, Seattle, United States, 2022.

Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman{‚}s correlation by 1.7{\%}. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.

@inproceedings{zhang-etal-2022-mcse,
title = {MCSE: Multimodal Contrastive Learning of Sentence Embeddings},
author = {Miaoran Zhang and Marius Mosbach and David Adelani and Michael Hedderich and Dietrich Klakow},
url = {https://aclanthology.org/2022.naacl-main.436},
doi = {https://doi.org/10.18653/v1/2022.naacl-main.436},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages = {5959-5969},
publisher = {Association for Computational Linguistics},
address = {Seattle, United States},
abstract = {Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman{'}s correlation by 1.7{\%}. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zouhar, Vilém; Mosbach, Marius; Zhang, Miaoran; Klakow, Dietrich

Knowledge Base Index Compression via Dimensionality and Precision Reduction Inproceedings

Spa-NLP workshop at ACL 2022, 22nd-27th May 2022 Dublin, Ireland, 2022.

Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction.
Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.

@inproceedings{Zouhar_2022_Base,
title = {Knowledge Base Index Compression via Dimensionality and Precision Reduction},
author = {Vil{\'e}m Zouhar and Marius Mosbach and Miaoran Zhang and Dietrich Klakow},
url = {https://arxiv.org/abs/2204.02906},
year = {2022},
date = {2022},
publisher = {Spa-NLP workshop at ACL 2022},
address = {22nd-27th May 2022 Dublin, Ireland},
abstract = {Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhu, Dawei; Mogadala, Aditya; Klakow, Dietrich

Image manipulation with natural language using Two-sided Attentive Conditional Generative Adversarial Network Journal Article

Neural Networks, 136, pp. 207-217, 2021, ISSN 0893-6080.

Altering the content of an image with photo editing tools is a tedious task for an inexperienced user. Especially, when modifying the visual attributes of a specific object in an image without affecting other constituents such as background etc. To simplify the process of image manipulation and to provide more control to users, it is better to utilize a simpler interface like natural language. Therefore, in this paper, we address the challenge of manipulating images using natural language description. We propose the Two-sidEd Attentive conditional Generative Adversarial Network (TEA-cGAN) to generate semantically manipulated images while preserving other contents such as background intact. TEA-cGAN uses fine-grained attention both in the generator and discriminator of Generative Adversarial Network (GAN) based framework at different scales. Experimental results show that TEA-cGAN which generates 128×128 and 256×256 resolution images outperforms existing methods on CUB and Oxford-102 datasets both quantitatively and qualitatively.

@article{zhumogadala:2020,
title = {Image manipulation with natural language using Two-sided Attentive Conditional Generative Adversarial Network},
author = {Dawei Zhu and Aditya Mogadala and Dietrich Klakow},
url = {https://www.sciencedirect.com/science/article/pii/S0893608020303257},
doi = {https://doi.org/10.1016/j.neunet.2020.09.002},
year = {2021},
date = {2021},
journal = {Neural Networks},
pages = {207-217},
volume = {136},
abstract = {Altering the content of an image with photo editing tools is a tedious task for an inexperienced user. Especially, when modifying the visual attributes of a specific object in an image without affecting other constituents such as background etc. To simplify the process of image manipulation and to provide more control to users, it is better to utilize a simpler interface like natural language. Therefore, in this paper, we address the challenge of manipulating images using natural language description. We propose the Two-sidEd Attentive conditional Generative Adversarial Network (TEA-cGAN) to generate semantically manipulated images while preserving other contents such as background intact. TEA-cGAN uses fine-grained attention both in the generator and discriminator of Generative Adversarial Network (GAN) based framework at different scales. Experimental results show that TEA-cGAN which generates 128x128 and 256x256 resolution images outperforms existing methods on CUB and Oxford-102 datasets both quantitatively and qualitatively.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Mogadala, Aditya; Kalimuthu, Marimuthu; Klakow, Dietrich

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods Journal Article

Journal of Artificial Intelligence Research, 71, Access Foundation, pp. 1183-1317, 2021.

The interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks. This has created significant interest in the integration of vision and language. The tasks are designed such that they perfectly embrace the ideas of deep learning. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey brings in innovative thoughts and ideas to address the existing challenges and build new applications.

@article{mogadala2021trends,
title = {Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods},
author = {Aditya Mogadala and Marimuthu Kalimuthu and Dietrich Klakow},
url = {https://arxiv.org/abs/1907.09358},
doi = {https://doi.org/10.1613/jair.1.11688},
year = {2021},
date = {2021},
journal = {Journal of Artificial Intelligence Research},
pages = {1183-1317},
publisher = {Access Foundation},
volume = {71},
abstract = {The interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). The largest of the growths in these fields has been made possible with deep learning, a sub-area of machine learning, which uses the principles of artificial neural networks. This has created significant interest in the integration of vision and language. The tasks are designed such that they perfectly embrace the ideas of deep learning. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey brings in innovative thoughts and ideas to address the existing challenges and build new applications.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius; Stenger, Irina; Avgustinova, Tania; Möbius, Bernd; Klakow, Dietrich

incom.py 2.0 - Calculating Linguistic Distances and Asymmetries in Auditory Perception of Closely Related Languages Inproceedings

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), INCOMA Ltd., pp. 968-977, Held Online, 2021.

We present an extended version of a tool developed for calculating linguistic distances and asymmetries in auditory perception of closely related languages. Along with evaluating the metrics available in the initial version of the tool, we introduce word adaptation entropy as an additional metric of linguistic asymmetry. Potential predictors of speech intelligibility are validated with human performance in spoken cognate recognition experiments for Bulgarian and Russian. Special attention is paid to the possibly different contributions of vowels and consonants in oral intercomprehension. Using incom.py 2.0 it is possible to calculate, visualize, and validate three measurement methods of linguistic distances and asymmetries as well as carrying out regression analyses in speech intelligibility between related languages.

@inproceedings{mosbach-etal-2021-incom,
title = {incom.py 2.0 - Calculating Linguistic Distances and Asymmetries in Auditory Perception of Closely Related Languages},
author = {Marius Mosbach and Irina Stenger and Tania Avgustinova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://aclanthology.org/2021.ranlp-1.110/},
year = {2021},
date = {2021-09-01},
booktitle = {Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)},
pages = {968-977},
publisher = {INCOMA Ltd.},
address = {Held Online},
abstract = {We present an extended version of a tool developed for calculating linguistic distances and asymmetries in auditory perception of closely related languages. Along with evaluating the metrics available in the initial version of the tool, we introduce word adaptation entropy as an additional metric of linguistic asymmetry. Potential predictors of speech intelligibility are validated with human performance in spoken cognate recognition experiments for Bulgarian and Russian. Special attention is paid to the possibly different contributions of vowels and consonants in oral intercomprehension. Using incom.py 2.0 it is possible to calculate, visualize, and validate three measurement methods of linguistic distances and asymmetries as well as carrying out regression analyses in speech intelligibility between related languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Mosbach, Marius; Andriushchenko, Maksym; Klakow, Dietrich

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines Inproceedings

International Conference on Learning Representations, 2021.

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches.

@inproceedings{mosbach2021on,
title = {On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines},
author = {Marius Mosbach and Maksym Andriushchenko and Dietrich Klakow},
url = {https://arxiv.org/abs/2006.04884},
year = {2021},
date = {2021},
booktitle = {International Conference on Learning Representations},
abstract = {Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Abdullah, Badr M.; Mosbach, Marius; Zaitova, Iuliia; Möbius, Bernd; Klakow, Dietrich

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study Inproceedings

Proceedings of Interspeech 2020, 2021.

Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

@inproceedings{Abdullah2021DoAW,
title = {Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study},
author = {Badr M. Abdullah and Marius Mosbach and Iuliia Zaitova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://arxiv.org/abs/2106.08686},
year = {2021},
date = {2021},
booktitle = {Proceedings of Interspeech 2020},
abstract = {Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C4 B4

Jágrová, Klára; Hedderich, Michael; Mosbach, Marius; Avgustinova, Tania; Klakow, Dietrich

On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers Journal Article

Frontiers in Psychology, 12, pp. 2296, 2021, ISSN 1664-1078.

This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility.

@article{10.3389/fpsyg.2021.662277,
title = {On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Michael Hedderich and Marius Mosbach and Tania Avgustinova and Dietrich Klakow},
url = {https://www.frontiersin.org/articles/10.3389/fpsyg.2021.662277/full},
doi = {https://doi.org/10.3389/fpsyg.2021.662277},
year = {2021},
date = {2021},
journal = {Frontiers in Psychology},
pages = {2296},
volume = {12},
abstract = {This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Zouhar, Vilém; Mosbach, Marius; Biswas, Debanjali; Klakow, Dietrich

Artefact Retrieval: Overview of NLP Models with Knowledge Base Access Inproceedings

Workshop on Commonsense Reasoning and Knowledge Bases, 2021.

Many NLP models gain performance by having access to a knowledge base. A lot of research has been devoted to devising and improving the way the knowledge base is accessed and incorporated into the model, resulting in a number of mechanisms and pipelines. Despite the diversity of proposed mechanisms, there are patterns in the designs of such systems. In this paper, we systematically describe the typology of *artefacts* (items retrieved from a knowledge base), retrieval mechanisms and the way these artefacts are *fused* into the model. This further allows us to uncover combinations of design decisions that had not yet been tried. Most of the focus is given to language models, though we also show how question answering, fact-checking and knowledgable dialogue models fit into this system as well. Having an abstract model which can describe the architecture of specific models also helps with transferring these architectures between multiple NLP tasks.

@inproceedings{zouhar2021artefact,
title = {Artefact Retrieval: Overview of NLP Models with Knowledge Base Access},
author = {Vil{\'e}m Zouhar and Marius Mosbach and Debanjali Biswas and Dietrich Klakow},
url = {https://arxiv.org/abs/2201.09651},
year = {2021},
date = {2021},
booktitle = {Workshop on Commonsense Reasoning and Knowledge Bases},
abstract = {Many NLP models gain performance by having access to a knowledge base. A lot of research has been devoted to devising and improving the way the knowledge base is accessed and incorporated into the model, resulting in a number of mechanisms and pipelines. Despite the diversity of proposed mechanisms, there are patterns in the designs of such systems. In this paper, we systematically describe the typology of *artefacts* (items retrieved from a knowledge base), retrieval mechanisms and the way these artefacts are *fused* into the model. This further allows us to uncover combinations of design decisions that had not yet been tried. Most of the focus is given to language models, though we also show how question answering, fact-checking and knowledgable dialogue models fit into this system as well. Having an abstract model which can describe the architecture of specific models also helps with transferring these architectures between multiple NLP tasks.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Kalimuthu, Marimuthu; Mogadala, Aditya; Mosbach, Marius; Klakow, Dietrich

Fusion Models for Improved Image Captioning Inproceedings

Pattern Recognition. ICPR International Workshops and Challenges, pp. 381-395, Cham, 2020.

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

@inproceedings{Kalimuthu2021fusion,
title = {Fusion Models for Improved Image Captioning},
author = {Marimuthu Kalimuthu and Aditya Mogadala and Marius Mosbach and Dietrich Klakow},
url = {https://arxiv.org/abs/2010.15251},
doi = {https://doi.org/10.1007/978-3-030-68780-9_32},
year = {2020},
date = {2020},
booktitle = {Pattern Recognition. ICPR International Workshops and Challenges},
pages = {381-395},
address = {Cham},
abstract = {Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Successfully