Publications

Alabi, Jesujoba; Azime, Israel Abebe; Zhang, Miaoran; España-Bonet, Cristina; Bawden, Rachel; Zhu, Dawei; Adelani, David; Odoje, Clement Oyeleke; Akinade, Idris; Maab, Iffat; David, Davis; Muhammad, Shamsuddeen Hassan; Putini, Neo ; Ademuyiwa, David O.; Caines, Andrew; Klakow, Dietrich

AFRIDOC-MT: Document-level MT Corpus for African Languages Inproceedings

Christodoulopoulos, Christos; Chakraborty, Tanmoy; Rose, Carolyn; Peng, Violet (Ed.): Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 27770-27806, Suzhou, China, 2025, ISBN 979-8-89176-332-6.

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating the ability of neural machine translation (NMT) models and large language models (LLMs) to translate between English and these languages, at both the sentence and pseudo-document levels, the outputs being realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieves the best average performance among the standard NMT models, while GPT-4o outperforms general-purpose LLMs. Fine-tuning selected models leads to substantial performance gains, but models trained on sentences struggle to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, over-generation, repetition of words and phrases, and off-target translations, specifically for translation into African languages.

@inproceedings{alabi-etal-2025-afridoc,
title = {AFRIDOC-MT: Document-level MT Corpus for African Languages},
author = {Jesujoba Alabi and Israel Abebe Azime and Miaoran Zhang and Cristina Espa{\~n}a-Bonet and Rachel Bawden and Dawei Zhu and David Adelani and Clement Oyeleke Odoje and Idris Akinade and Iffat Maab and Davis David and Shamsuddeen Hassan Muhammad and Neo Putini and David O. Ademuyiwa and Andrew Caines and Dietrich Klakow},
editor = {Christos Christodoulopoulos and Tanmoy Chakraborty and Carolyn Rose and Violet Peng},
url = {https://aclanthology.org/2025.emnlp-main.1413/},
doi = {https://doi.org/10.18653/v1/2025.emnlp-main.1413},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
isbn = {979-8-89176-332-6},
pages = {27770-27806},
publisher = {Association for Computational Linguistics},
address = {Suzhou, China},
abstract = {This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùb{\'a}, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating the ability of neural machine translation (NMT) models and large language models (LLMs) to translate between English and these languages, at both the sentence and pseudo-document levels, the outputs being realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieves the best average performance among the standard NMT models, while GPT-4o outperforms general-purpose LLMs. Fine-tuning selected models leads to substantial performance gains, but models trained on sentences struggle to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, over-generation, repetition of words and phrases, and off-target translations, specifically for translation into African languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 B6

Alabi, Jesujoba; Hedderich, Michael; Adelani, David; Klakow, Dietrich

Charting the landscape of african nlp: Mapping progress and shaping the road ahead Inproceedings

Christodoulopoulos, Christos; Chakraborty, Tanmoy; Rose, Carolyn; Peng, Violet (Ed.): Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 27807-27841, Suzhou, China, 2025, ISBN 979-8-89176-332-6.

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors{—}including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

@inproceedings{alabi-etal-2025-charting,
title = {Charting the landscape of african nlp: Mapping progress and shaping the road ahead},
author = {Jesujoba Alabi and Michael Hedderich and David Adelani and Dietrich Klakow},
editor = {Christos Christodoulopoulos and Tanmoy Chakraborty and Carolyn Rose and Violet Peng},
url = {https://aclanthology.org/2025.emnlp-main.1414/},
doi = {https://doi.org/10.18653/v1/2025.emnlp-main.1414},
year = {2025},
date = {2025},
booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
isbn = {979-8-89176-332-6},
pages = {27807-27841},
publisher = {Association for Computational Linguistics},
address = {Suzhou, China},
abstract = {With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors{---}including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Steuer, Julius; Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Mosbach, Marius; Klakow, Dietrich

Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 12-23, Torino, Italia, 2024.

This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.

@inproceedings{steuer-etal-2024-modeling ,
title = {Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal},
author = {Julius Steuer and Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Marius Mosbach and Dietrich Klakow},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.2/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {12-23},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This study presents an analysis of diachronic linguistic changes in English scientific writing, utilizing surprisal from transformer-based language models. Unlike traditional n-gram models, transformer-based models are potentially better at capturing nuanced linguistic changes such as long-range dependencies by considering variable context sizes. However, to create diachronically comparable language models there are several challenges with historical data, notably an exponential increase in no. of texts, tokens per text and vocabulary size over time. We address these by using a shared vocabulary and employing a robust training strategy that includes initial uniform sampling from the corpus and continuing pre-training on specific temporal segments. Our empirical analysis highlights the predictive power of surprisal from transformer-based models, particularly in analyzing complex linguistic structures like relative clauses. The models’ broader contextual awareness and the inclusion of dependency length annotations contribute to a more intricate understanding of communicative efficiency. While our focus is on scientific English, our approach can be applied to other low-resource scenarios.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4

Zhu, Dawei; Chen, Pinzhen; Zhang, Miaoran; Haddow, Barry; Shen, Xiaoyu; Klakow, Dietrich

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice? Inproceedings

Al-Onaizan, Yaser; Bansal, Mohit; Chen, Yun-Nung (Ed.): Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 388-409, Miami, Florida, USA, 2024.

Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages. Problems also arise when noisy synthetic data is placed on the target side, especially when the target language is well-represented in LLM pre-training. Yet interestingly, synthesized data in an under-represented language has a less pronounced effect. Our findings suggest that when adapting LLMs to translation, the requirement on data quantity can be eased but careful considerations are still crucial to prevent an LLM from exploiting unintended data biases.

@inproceedings{zhu-etal-2024-fine,
title = {Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?},
author = {Dawei Zhu and Pinzhen Chen and Miaoran Zhang and Barry Haddow and Xiaoyu Shen and Dietrich Klakow},
editor = {Yaser Al-Onaizan and Mohit Bansal and Yun-Nung Chen},
url = {https://aclanthology.org/2024.emnlp-main.24/},
doi = {https://doi.org/10.18653/v1/2024.emnlp-main.24},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages = {388-409},
publisher = {Association for Computational Linguistics},
address = {Miami, Florida, USA},
abstract = {Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages. Problems also arise when noisy synthetic data is placed on the target side, especially when the target language is well-represented in LLM pre-training. Yet interestingly, synthesized data in an under-represented language has a less pronounced effect. Our findings suggest that when adapting LLMs to translation, the requirement on data quantity can be eased but careful considerations are still crucial to prevent an LLM from exploiting unintended data biases.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Greenberg, Clayton

Evaluating humanness in language models PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

Advances with language models, systems that predict upcoming words in context, have enabled an era in which people sometimes cannot distinguish between human-written and artificially created text. Perplexity, the simplest and most popular way to evaluate the quality of a language model, rewards any pattern captured by the system as long as it robustly constrains the upcoming possibilities. By capturing patterns that humans do not use, optimizing a language model for minimal perplexity could trigger a divergence between the most probable text and the most human-like text. In this thesis, I argue that this divergence has happened for state-of-the-art language models. Part I characterizes the kinds of knowledge captured by language models. First, I present three novel language model architectures whose neural connections were inspired by human behavior. Then, I discuss novel morphology- and sentiment-based paradigms that capture human knowledge quantitatively. Part II establishes several methods for evaluating language models by comparison against human behavior measures. I consider the suitability and potential confounds for offline ratings and two paradigms of online reading times: eye-tracking and G-Maze. Then, I use a novel dataset of G-Maze response times to show computational and linguistic evidence of the divergence.


Fortschritte bei Sprachmodellen (LMs) – Systeme, die aus dem Kontext heraus nachfolgende Worte vorhersagen – haben dazu geführt, dass Menschen manchmal nicht mehr zwischen von Menschen geschriebenem und künstlich erzeugtem Text unterscheiden können. Perplexität (PPL), die einfachste und beliebteste Methode zur Bewertung der Qualität eines LM, belohnt jedes vom System erfasste Muster, solange es die kommenden Möglichkeiten stark einschränkt. Durch die Erfassung von Mustern, die Menschen nicht verwenden, könnte die Optimierung eines LM hinsichtlich minimaler PPL zu einer Divergenz zwischen dem wahrscheinlichsten Text und dem menschenähnlichsten Text führen. In dieser Arbeit wird argumentiert, dass diese Divergenz bei modernen LMs aufgetreten ist. Teil I charakterisiert die Arten von Wissen, die von LMs erfasst werden. Zuerst werden drei neue LM-Architekturen beschreiben, deren neuronale Verbindungen von menschlichem Verhalten inspiriert wurden. Danach werden neuartige morphologie- und sentiment-basierte Paradigmen diskutiert, die menschliches Verhalten quantitativ erfassen. In Teil II werden mehrere Methoden entwickelt, die LMs durch Vergleich mit menschlichen Verhaltensmaßen bewerten. Diskutiert werden die Eignung und mögliche Störfaktoren für Offline-Bewertungen und zwei Paradigmen von Online-Lesezeiten: Eye-Tracking und G-Maze. Ein neuartiger Datensatz der G-Maze-Antwortzeiten wird dazu verwendet, um rechnerische und sprachliche Beweise für die Divergenz zu liefern.

@phdthesis{Greenberg_Diss,
title = {Evaluating humanness in language models},
author = {Clayton Greenberg},
url = {https://jahrbib.sulb.uni-saarland.de/handle/20.500.11880/37534},
doi = {https://doi.org/10.22028/D291-41943},
year = {2024},
date = {2024},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {Advances with language models, systems that predict upcoming words in context, have enabled an era in which people sometimes cannot distinguish between human-written and artificially created text. Perplexity, the simplest and most popular way to evaluate the quality of a language model, rewards any pattern captured by the system as long as it robustly constrains the upcoming possibilities. By capturing patterns that humans do not use, optimizing a language model for minimal perplexity could trigger a divergence between the most probable text and the most human-like text. In this thesis, I argue that this divergence has happened for state-of-the-art language models. Part I characterizes the kinds of knowledge captured by language models. First, I present three novel language model architectures whose neural connections were inspired by human behavior. Then, I discuss novel morphology- and sentiment-based paradigms that capture human knowledge quantitatively. Part II establishes several methods for evaluating language models by comparison against human behavior measures. I consider the suitability and potential confounds for offline ratings and two paradigms of online reading times: eye-tracking and G-Maze. Then, I use a novel dataset of G-Maze response times to show computational and linguistic evidence of the divergence.


Fortschritte bei Sprachmodellen (LMs) - Systeme, die aus dem Kontext heraus nachfolgende Worte vorhersagen - haben dazu gef{\"u}hrt, dass Menschen manchmal nicht mehr zwischen von Menschen geschriebenem und k{\"u}nstlich erzeugtem Text unterscheiden k{\"o}nnen. Perplexit{\"a}t (PPL), die einfachste und beliebteste Methode zur Bewertung der Qualit{\"a}t eines LM, belohnt jedes vom System erfasste Muster, solange es die kommenden M{\"o}glichkeiten stark einschr{\"a}nkt. Durch die Erfassung von Mustern, die Menschen nicht verwenden, k{\"o}nnte die Optimierung eines LM hinsichtlich minimaler PPL zu einer Divergenz zwischen dem wahrscheinlichsten Text und dem menschen{\"a}hnlichsten Text f{\"u}hren. In dieser Arbeit wird argumentiert, dass diese Divergenz bei modernen LMs aufgetreten ist. Teil I charakterisiert die Arten von Wissen, die von LMs erfasst werden. Zuerst werden drei neue LM-Architekturen beschreiben, deren neuronale Verbindungen von menschlichem Verhalten inspiriert wurden. Danach werden neuartige morphologie- und sentiment-basierte Paradigmen diskutiert, die menschliches Verhalten quantitativ erfassen. In Teil II werden mehrere Methoden entwickelt, die LMs durch Vergleich mit menschlichen Verhaltensma{\ss}en bewerten. Diskutiert werden die Eignung und m{\"o}gliche St{\"o}rfaktoren f{\"u}r Offline-Bewertungen und zwei Paradigmen von Online-Lesezeiten: Eye-Tracking und G-Maze. Ein neuartiger Datensatz der G-Maze-Antwortzeiten wird dazu verwendet, um rechnerische und sprachliche Beweise f{\"u}r die Divergenz zu liefern.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Zhang, Miaoran; Mingyang, Wang; Alabi, Jesujoba; Klakow, Dietrich

AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness Inproceedings

Kr. Ojha, Atul; Seza Doğruöz, A.; Tayyar Madabushi, Harish; Da San Martino, Giovanni; Rosenthal, Sara; Rosá, Aiala (Ed.): Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics, pp. 800-810, Mexico City, Mexico, 2024.

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

@inproceedings{zhang2024aadamsemeval2024task1,
title = {AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness},
author = {Miaoran Zhang and Wang Mingyang and Jesujoba Alabi and Dietrich Klakow},
editor = {Atul Kr. Ojha and A. Seza Doğru{\"o}z and Harish Tayyar Madabushi and Giovanni Da San Martino and Sara Rosenthal and Aiala Ros{\'a}},
url = {https://aclanthology.org/2024.semeval-1.114},
doi = {https://doi.org/10.18653/v1/2024.semeval-1.114},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)},
pages = {800-810},
publisher = {Association for Computational Linguistics},
address = {Mexico City, Mexico},
abstract = {This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Chingacham, Anupama; Zhang, Miaoran; Demberg, Vera; Klakow, Dietrich

Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? Inproceedings

Soni, Nikita; Flek, Lucie; Sharma, Ashish; Yang, Diyi; Hooker, Sara; Andrew Schwartz, H. (Ed.): Proceedings of the 1st Human-Centered Large Language Modeling Workshop, Association for Computational Linguistics, pp. 1-15, 2024.

Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text.However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic.We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise.Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline.Our approach resulted in a 40{\%} relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.

@inproceedings{chingacham-etal-2024-human,
title = {Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?},
author = {Anupama Chingacham and Miaoran Zhang and Vera Demberg and Dietrich Klakow},
editor = {Nikita Soni and Lucie Flek and Ashish Sharma and Diyi Yang and Sara Hooker and H. Andrew Schwartz},
url = {https://aclanthology.org/2024.hucllm-1.1/},
doi = {https://doi.org/10.18653/v1/2024.hucllm-1.1},
year = {2024},
date = {2024-08-30},
booktitle = {Proceedings of the 1st Human-Centered Large Language Modeling Workshop},
pages = {1-15},
publisher = {Association for Computational Linguistics},
abstract = {Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text.However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic.We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise.Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline.Our approach resulted in a 40{\%} relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Lin, Pin-Jie; Zhang, Miaoran; Mosbach, Marius; Klakow, Dietrich

Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning Inproceedings

Fu, Xiyan; Fleisig, Eve, Eve (Ed.): Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Association for Computational Linguistics, pp. 170-185, Bangkok, Thailand, 2024, ISBN 979-8-89176-097-4.

Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59{\%} to 3.96{\%}. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.

@inproceedings{lin-etal-2024-exploring,
title = {Exploring the Effectiveness and Consistency of Task Selection in Intermediate-Task Transfer Learning},
author = {Pin-Jie Lin and Miaoran Zhang and Marius Mosbach and Dietrich Klakow},
editor = {Xiyan Fu and Eve Fleisig Eve},
url = {https://aclanthology.org/2024.acl-srw.24/},
doi = {https://doi.org/10.18653/v1/2024.acl-srw.24},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)},
isbn = {979-8-89176-097-4},
pages = {170-185},
publisher = {Association for Computational Linguistics},
address = {Bangkok, Thailand},
abstract = {Identifying beneficial tasks to transfer from is a critical step toward successful intermediate-task transfer learning. In this work, we experiment with 130 source-target task combinations and demonstrate that the transfer performance exhibits severe variance across different source tasks and training seeds, highlighting the crucial role of intermediate-task selection in a broader context. We compare four representative task selection methods in a unified setup, focusing on their effectiveness and consistency. Compared to embedding-free methods and text embeddings, task embeddings constructed from fine-tuned weights can better estimate task transferability by improving task prediction scores from 2.59{\%} to 3.96{\%}. Despite their strong performance, we observe that the task embeddings do not consistently demonstrate superiority for tasks requiring reasoning abilities. Furthermore, we introduce a novel method that measures pairwise token similarity using maximum inner product search, leading to the highest performance in task prediction. Our findings suggest that token-wise similarity is better predictive for predicting transferability compared to averaging weights.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhang, Miaoran; Gautam, Vagrant; Mingyang, Wang; Alabi, Jesujoba; Shen, Xiaoyu; Klakow, Dietrich; Mosbach, Marius

The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis Inproceedings

Ku, Lun-Wei; Martins, Andre; Srikumar, Vivek (Ed.): Findings of the Association for Computational Linguistics: ACL 2024, Association for Computational Linguistics, pp. 7342-7371, Bangkok, Thailand, 2024.

In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.

@inproceedings{zhang-etal-2024-impact,
title = {The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis},
author = {Miaoran Zhang and Vagrant Gautam and Wang Mingyang and Jesujoba Alabi and Xiaoyu Shen and Dietrich Klakow and Marius Mosbach},
editor = {Lun-Wei Ku and Andre Martins and Vivek Srikumar},
url = {https://aclanthology.org/2024.findings-acl.438/},
doi = {https://doi.org/10.18653/v1/2024.findings-acl.438},
year = {2024},
date = {2024},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2024},
pages = {7342-7371},
publisher = {Association for Computational Linguistics},
address = {Bangkok, Thailand},
abstract = {In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius

Analyzing pre-trained and fine-tuned language models PhD Thesis

Saarländische Universitäts- und Landesbibliothek, Saarland University, Saarbruecken, Germany, 2024.

The field of natural language processing (NLP) has recently undergone a paradigm shift. Since the introduction of transformer-based language models in 2018, the current generation of natural language processing models continues to demonstrate impressive capabilities on a variety of academic benchmarks and real-world applications. This paradigm shift is based on a simple but general pipeline which consists of pre-training neural language models on large quantities of text, followed by an adaptation step that fine-tunes the pre-trained model to perform a specific NLP task of interest. Despite the impressive progress on academic benchmarks and the widespread deployment of pre-trained and fine-tuned language models in industry, these models do not come without shortcomings which often have immediate consequences for the robustness and generalization of fine-tuned language models. Moreover, these shortcomings demonstrate that we still lack a fundamental understanding of how and why pre-trained and fine-tuned language models work as well as the individual steps of the pipeline that produce them. This thesis makes several contributions towards improving our understanding of pre-trained and fine-tuned language models by carrying out a detailed analysis of various parts of the modern NLP pipeline. Our contributions range from analyzing the linguistic knowledge of pre-trained language models and how it is affected by fine-tuning, to a rigorous analysis of the fine-tuning process itself and how the choice of adaptation technique affects the generalization of models. Overall, we provide new insights about previously unexplained phenomena and the capabilities of pre-trained and fine-tuned language models.


Im Bereich der Verarbeitung natürlicher Sprache (NLP) hat sich ein Paradigmenwechsel vollzogen. Seit der Einführung von transformer-basierten Sprachmodellen im Jahr 2018 zeigt die aktuelle Generation neuronaler Sprachverarbeitungsmodelle beeindruckende Fähigkeiten bei einer Vielzahl von akademischen Benchmarks und realen Anwendungen. Dieser Paradigmenwechsel basiert auf einer einfachen, aber allgemeinen Pipeline, die aus dem Vortrainieren von neuronalen Sprachmodellen auf großen Textmengen besteht, gefolgt von einem Anpassungsschritt, der das vortrainierte Modell modifiziert, um eine bestimmte NLP-Aufgabe durchzuführen. Trotz des beeindruckenden Fortschritts bei akademischen Benchmarks und des weit verbreiteten Einsatzes von vortrainierten und angepassten Sprachmodellen in der Industrie sind diese Modelle nicht ohne Mängel, und oft haben diese Mängel unmittelbare Auswirkungen auf die Robustheit und Generalisierung der Sprachmodelle. Darüber hinaus zeigen sie, dass uns einerseits noch immer ein grundlegendes Verständnis dafür fehlt, wie und warum vortrainierte und angepasste Sprachmodelle funktionieren, andererseits fehlt ein grundlegendes Verständnis der einzelnen Schritte der Pipeline. Diese Arbeit leistet mehrere Beiträge zur Verbesserung unseres Verständnisses von vortrainierten und angepassten Sprachmodellen, indem sie eine detaillierte Analyse verschiedener Teile der modernen NLP-Pipeline durchführt. Unsere Beiträge reichen von der Analyse des linguistischen Wissens von vortrainierten Sprachmodellen und wie dieses durch die Anpassung beeinflusst wird bis hin zu einer rigorosen Analyse des Anpassungsprozesses selbst und wie die Wahl der Anpassungstechnik die Generalisierung von Modellen beeinflusst, und liefern insgesamt neue Erkenntnisse über bisher unerklärte Phänomene und Fähigkeiten von vortrainierten und angepassten Sprachmodellen.

@phdthesis{Mosbach-2024-Thesis,
title = {Analyzing pre-trained and fine-tuned language models},
author = {Marius Mosbach},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/37254},
doi = {https://doi.org/10.22028/D291-41531},
year = {2024},
date = {2024-02-19},
school = {Saarland University},
publisher = {Saarl{\"a}ndische Universit{\"a}ts- und Landesbibliothek},
address = {Saarbruecken, Germany},
abstract = {The field of natural language processing (NLP) has recently undergone a paradigm shift. Since the introduction of transformer-based language models in 2018, the current generation of natural language processing models continues to demonstrate impressive capabilities on a variety of academic benchmarks and real-world applications. This paradigm shift is based on a simple but general pipeline which consists of pre-training neural language models on large quantities of text, followed by an adaptation step that fine-tunes the pre-trained model to perform a specific NLP task of interest. Despite the impressive progress on academic benchmarks and the widespread deployment of pre-trained and fine-tuned language models in industry, these models do not come without shortcomings which often have immediate consequences for the robustness and generalization of fine-tuned language models. Moreover, these shortcomings demonstrate that we still lack a fundamental understanding of how and why pre-trained and fine-tuned language models work as well as the individual steps of the pipeline that produce them. This thesis makes several contributions towards improving our understanding of pre-trained and fine-tuned language models by carrying out a detailed analysis of various parts of the modern NLP pipeline. Our contributions range from analyzing the linguistic knowledge of pre-trained language models and how it is affected by fine-tuning, to a rigorous analysis of the fine-tuning process itself and how the choice of adaptation technique affects the generalization of models. Overall, we provide new insights about previously unexplained phenomena and the capabilities of pre-trained and fine-tuned language models.


Im Bereich der Verarbeitung nat{\"u}rlicher Sprache (NLP) hat sich ein Paradigmenwechsel vollzogen. Seit der Einf{\"u}hrung von transformer-basierten Sprachmodellen im Jahr 2018 zeigt die aktuelle Generation neuronaler Sprachverarbeitungsmodelle beeindruckende F{\"a}higkeiten bei einer Vielzahl von akademischen Benchmarks und realen Anwendungen. Dieser Paradigmenwechsel basiert auf einer einfachen, aber allgemeinen Pipeline, die aus dem Vortrainieren von neuronalen Sprachmodellen auf gro{\ss}en Textmengen besteht, gefolgt von einem Anpassungsschritt, der das vortrainierte Modell modifiziert, um eine bestimmte NLP-Aufgabe durchzuf{\"u}hren. Trotz des beeindruckenden Fortschritts bei akademischen Benchmarks und des weit verbreiteten Einsatzes von vortrainierten und angepassten Sprachmodellen in der Industrie sind diese Modelle nicht ohne M{\"a}ngel, und oft haben diese M{\"a}ngel unmittelbare Auswirkungen auf die Robustheit und Generalisierung der Sprachmodelle. Dar{\"u}ber hinaus zeigen sie, dass uns einerseits noch immer ein grundlegendes Verst{\"a}ndnis daf{\"u}r fehlt, wie und warum vortrainierte und angepasste Sprachmodelle funktionieren, andererseits fehlt ein grundlegendes Verst{\"a}ndnis der einzelnen Schritte der Pipeline. Diese Arbeit leistet mehrere Beitr{\"a}ge zur Verbesserung unseres Verst{\"a}ndnisses von vortrainierten und angepassten Sprachmodellen, indem sie eine detaillierte Analyse verschiedener Teile der modernen NLP-Pipeline durchf{\"u}hrt. Unsere Beitr{\"a}ge reichen von der Analyse des linguistischen Wissens von vortrainierten Sprachmodellen und wie dieses durch die Anpassung beeinflusst wird bis hin zu einer rigorosen Analyse des Anpassungsprozesses selbst und wie die Wahl der Anpassungstechnik die Generalisierung von Modellen beeinflusst, und liefern insgesamt neue Erkenntnisse {\"u}ber bisher unerkl{\"a}rte Ph{\"a}nomene und F{\"a}higkeiten von vortrainierten und angepassten Sprachmodellen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Steuer, Julius; Mosbach, Marius; Klakow, Dietrich

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures Inproceedings

Warstadt, Alex; Mueller, Aaron; Choshen, Leshem; Wilcox, Ethan; Zhuang, Chengxu; Ciro, Juan; Rafael, Mosquera; Paranjabe, Bhargavi; Williams, Adina; Linzen, Tal; Cotterell, Ryan (Ed.): Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Association for Computational Linguistics, pp. 142-157, Singapore, 2023.

Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.

@inproceedings{steuer-etal-2023-large,
title = {Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures},
author = {Julius Steuer and Marius Mosbach and Dietrich Klakow},
editor = {Alex Warstadt and Aaron Mueller and Leshem Choshen and Ethan Wilcox and Chengxu Zhuang and Juan Ciro and Mosquera Rafael and Bhargavi Paranjabe and Adina Williams and Tal Linzen and Ryan Cotterell},
url = {https://aclanthology.org/2023.conll-babylm.12/},
doi = {https://doi.org/10.18653/v1/2023.conll-babylm.12},
year = {2023},
date = {2023},
booktitle = {Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning},
pages = {142-157},
publisher = {Association for Computational Linguistics},
address = {Singapore},
abstract = {Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Gautam, Vagrant; Zhang, Miaoran; Klakow, Dietrich

A Lightweight Method to Generate Unanswerable Questions in English Inproceedings

Bouamor, Houda; Pino, Juan; Bali, Kalika (Ed.): Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, pp. 7349-7360, Singapore, 2023.

If a question cannot be answered with the available information, robust systems for question answering (QA) should know *not* to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.

@inproceedings{gautam-etal-2023-lightweight,
title = {A Lightweight Method to Generate Unanswerable Questions in English},
author = {Vagrant Gautam and Miaoran Zhang and Dietrich Klakow},
editor = {Houda Bouamor and Juan Pino and Kalika Bali},
url = {https://aclanthology.org/2023.findings-emnlp.491},
doi = {https://doi.org/10.18653/v1/2023.findings-emnlp.491},
year = {2023},
date = {2023},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023},
pages = {7349-7360},
publisher = {Association for Computational Linguistics},
address = {Singapore},
abstract = {If a question cannot be answered with the available information, robust systems for question answering (QA) should know *not* to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing automated approaches is not justified, we examine a simpler data augmentation method for unanswerable question generation in English: performing antonym and entity swaps on answerable questions. Compared to the prior state-of-the-art, data generated with our training-free and lightweight strategy results in better models (+1.6 F1 points on SQuAD 2.0 data with BERT-large), and has higher human-judged relatedness and readability. We quantify the raw benefits of our approach compared to no augmentation across multiple encoder models, using different amounts of generated data, and also on TydiQA-MinSpan data (+9.3 F1 points with BERT-large). Our results establish swaps as a simple but strong baseline for future work.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhu, Dawei; Shen, Xiaoyu; Mosbach, Marius; Stephan, Andreas; Klakow, Dietrich

Weaker Than You Think: A Critical Look at Weakly Supervised Learning Inproceedings

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, pp. 14229-14253, Toronto, Canada, 2023.

Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.

@inproceedings{zhu-etal-2023-weaker,
title = {Weaker Than You Think: A Critical Look at Weakly Supervised Learning},
author = {Dawei Zhu and Xiaoyu Shen and Marius Mosbach and Andreas Stephan and Dietrich Klakow},
url = {https://aclanthology.org/2023.acl-long.796},
doi = {https://doi.org/10.18653/v1/2023.acl-long.796},
year = {2023},
date = {2023-09-21},
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages = {14229-14253},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
abstract = {Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius; Pimentel, Tiago; Ravfogel, Shauli; Klakow, Dietrich; Elazar, Yanai

Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation Inproceedings

Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, pp. 12284-12314, Toronto, Canada, 2023.

Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations.Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.

@inproceedings{mosbach-etal-2023-shot,
title = {Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation},
author = {Marius Mosbach and Tiago Pimentel and Shauli Ravfogel and Dietrich Klakow and Yanai Elazar},
url = {https://aclanthology.org/2023.findings-acl.779},
doi = {https://doi.org/10.18653/v1/2023.findings-acl.779},
year = {2023},
date = {2023},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2023},
pages = {12284-12314},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
abstract = {Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations.Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Steuer, Julius; Abdullah, Badr M.; List, Johann-Mattis; Klakow, Dietrich

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists Inproceedings

Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 96-109, Dubrovnik, Croatia, 2023.

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

@inproceedings{steuer-etal-2023-information,
title = {Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists},
author = {Julius Steuer and Badr M. Abdullah and Johann-Mattis List and Dietrich Klakow},
url = {https://aclanthology.org/2023.sigtyp-1.10},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {96-109},
publisher = {Association for Computational Linguistics},
address = {Dubrovnik, Croatia},
abstract = {We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Hedderich, Michael

Weak supervision and label noise handling for Natural language processing in low-resource scenarios PhD Thesis

Saarland University, Saarbruecken, Germany, 2022.

The lack of large amounts of labeled data is a significant factor blocking many low-resource languages and domains from catching up with recent advancements in natural language processing. To reduce this dependency on labeled instances, weak supervision (semi-)automatically annotates unlabeled data. These labels can be obtained more quickly and cheaply than manual, gold-standard annotations. They also, however, contain more errors. Handling these noisy labels is often required to leverage the weakly supervised data successfully. In this dissertation, we study the whole weak supervision pipeline with a focus on the task of named entity recognition. We develop a tool for automatic annotation, and we propose an approach to model label noise when a small amount of clean data is available. We study the factors that influence the noise model’s quality from a theoretic perspective, and we validate this approach empirically on several different tasks and languages. An important aspect is the aim for a realistic evaluation. We perform our analysis, among others, on several African low-resource languages. We show the performance benefits that can be achieved using weak supervision and label noise modeling. But we also highlight open issues that the field still has to overcome. For the low-resource settings, we expand the analysis to few-shot learning. For classification errors, we present a novel approach to obtain interpretable insights of where classifiers fail.


Der Mangel an annotierten Daten ist ein wesentlicher Faktor, der viele Sprachen und Domänen mit geringen Ressourcen daran hindert, mit den jüngsten Fortschritten in der digitalen Textverarbeitung Schritt zu halten. Um diese Abhängigkeit von gelabelten Trainingsdaten zu verringern, werden bei Weak Supervision nicht gelabelte Daten (halb-)automatisch annotiert. Diese Annotationen sind schneller und günstiger zu erhalten. Sie enthalten jedoch auch mehr Fehler. Oft ist eine besondere Behandlung dieser Noisy Labels notwendig, um die Daten erfolgreich nutzen zu können. In dieser Dissertation untersuchen wir die gesamte Weak Supervision Pipeline mit einem Schwerpunkt auf den Einsatz für die Erkennung von Entitäten. Wir entwickeln ein Tool zur automatischen Annotation und präsentieren einen neuen Ansatz zur Modellierung von Noisy Labels. Wir untersuchen die Faktoren, die die Qualität dieses Modells aus theoretischer Sicht beeinflussen, und wir validieren den Ansatz empirisch für verschiedene Aufgaben und Sprachen. Ein wichtiger Aspekt dieser Arbeit ist das Ziel einer realistischen Analyse. Die Untersuchung führen wir unter anderem an mehreren afrikanischen Sprachen durch und zeigen die Leistungsvorteile, die durch Weak Supervision und die Modellierung von Label Noise erreicht werden können. Auch erweitern wir die Analyse auf das Lernen mit wenigen Beispielen. In Bezug auf Klassifizierungsfehler, stellen wir zudem einen neuen Ansatz vor, um interpretierbare Erkenntnisse zu gewinnen.

@phdthesis{Hedderich_Diss_2022,
title = {Weak supervision and label noise handling for Natural language processing in low-resource scenarios},
author = {Michael Hedderich},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/35026},
doi = {https://doi.org/10.22028/D291-38691},
year = {2022},
date = {2022},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {The lack of large amounts of labeled data is a significant factor blocking many low-resource languages and domains from catching up with recent advancements in natural language processing. To reduce this dependency on labeled instances, weak supervision (semi-)automatically annotates unlabeled data. These labels can be obtained more quickly and cheaply than manual, gold-standard annotations. They also, however, contain more errors. Handling these noisy labels is often required to leverage the weakly supervised data successfully. In this dissertation, we study the whole weak supervision pipeline with a focus on the task of named entity recognition. We develop a tool for automatic annotation, and we propose an approach to model label noise when a small amount of clean data is available. We study the factors that influence the noise model's quality from a theoretic perspective, and we validate this approach empirically on several different tasks and languages. An important aspect is the aim for a realistic evaluation. We perform our analysis, among others, on several African low-resource languages. We show the performance benefits that can be achieved using weak supervision and label noise modeling. But we also highlight open issues that the field still has to overcome. For the low-resource settings, we expand the analysis to few-shot learning. For classification errors, we present a novel approach to obtain interpretable insights of where classifiers fail.


Der Mangel an annotierten Daten ist ein wesentlicher Faktor, der viele Sprachen und Dom{\"a}nen mit geringen Ressourcen daran hindert, mit den j{\"u}ngsten Fortschritten in der digitalen Textverarbeitung Schritt zu halten. Um diese Abh{\"a}ngigkeit von gelabelten Trainingsdaten zu verringern, werden bei Weak Supervision nicht gelabelte Daten (halb-)automatisch annotiert. Diese Annotationen sind schneller und g{\"u}nstiger zu erhalten. Sie enthalten jedoch auch mehr Fehler. Oft ist eine besondere Behandlung dieser Noisy Labels notwendig, um die Daten erfolgreich nutzen zu k{\"o}nnen. In dieser Dissertation untersuchen wir die gesamte Weak Supervision Pipeline mit einem Schwerpunkt auf den Einsatz f{\"u}r die Erkennung von Entit{\"a}ten. Wir entwickeln ein Tool zur automatischen Annotation und pr{\"a}sentieren einen neuen Ansatz zur Modellierung von Noisy Labels. Wir untersuchen die Faktoren, die die Qualit{\"a}t dieses Modells aus theoretischer Sicht beeinflussen, und wir validieren den Ansatz empirisch f{\"u}r verschiedene Aufgaben und Sprachen. Ein wichtiger Aspekt dieser Arbeit ist das Ziel einer realistischen Analyse. Die Untersuchung f{\"u}hren wir unter anderem an mehreren afrikanischen Sprachen durch und zeigen die Leistungsvorteile, die durch Weak Supervision und die Modellierung von Label Noise erreicht werden k{\"o}nnen. Auch erweitern wir die Analyse auf das Lernen mit wenigen Beispielen. In Bezug auf Klassifizierungsfehler, stellen wir zudem einen neuen Ansatz vor, um interpretierbare Erkenntnisse zu gewinnen.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Alabi, Jesujoba; Adelani, David; Mosbach, Marius; Klakow, Dietrich

Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning Inproceedings

Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, pp. 4336-4349, Gyeongju, Republic of Korea, 2022.

Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {—} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.

@inproceedings{alabi-etal-2022-adapting,
title = {Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning},
author = {Jesujoba Alabi and David Adelani and Marius Mosbach and Dietrich Klakow},
url = {https://aclanthology.org/2022.coling-1.382},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 29th International Conference on Computational Linguistics},
pages = {4336-4349},
publisher = {International Committee on Computational Linguistics},
address = {Gyeongju, Republic of Korea},
abstract = {Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) {---} fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to target language individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50{\%}. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zhang, Miaoran; Mosbach, Marius; Adelani, David; Hedderich, Michael; Klakow, Dietrich

MCSE: Multimodal Contrastive Learning of Sentence Embeddings Inproceedings

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 5959-5969, Seattle, United States, 2022.

Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman{‚}s correlation by 1.7{\%}. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.

@inproceedings{zhang-etal-2022-mcse,
title = {MCSE: Multimodal Contrastive Learning of Sentence Embeddings},
author = {Miaoran Zhang and Marius Mosbach and David Adelani and Michael Hedderich and Dietrich Klakow},
url = {https://aclanthology.org/2022.naacl-main.436},
doi = {https://doi.org/10.18653/v1/2022.naacl-main.436},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages = {5959-5969},
publisher = {Association for Computational Linguistics},
address = {Seattle, United States},
abstract = {Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman{'}s correlation by 1.7{\%}. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Hedderich, Michael; Fischer, Jonas; Klakow, Dietrich; Vreeken, Jilles

Label-descriptive patterns and their application to characterizing classification errors Inproceedings

International Conference on Machine Learning, PMLR, pp. 8691-8707, 2022.

State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (i.e., patterns) that strongly correlate with correct resp. erroneous predictions to obtain a global and interpretable description for arbitrary classifiers. We show this is an instance of the more general label description problem, which we formulate in terms of the Minimum Description Length principle. To discover a good pattern set, we develop the efficient Premise algorithm. Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. Unlike existing solutions, it ably recovers ground truth patterns, even on highly imbalanced data over many features. Through two case studies on Visual Question Answering and Named Entity Recognition, we confirm that Premise gives clear and actionable insight into the systematic errors made by modern NLP classifiers.

@inproceedings{hedderich2022label,
title = {Label-descriptive patterns and their application to characterizing classification errors},
author = {Michael Hedderich and Jonas Fischer and Dietrich Klakow and Jilles Vreeken},
url = {https://arxiv.org/abs/2110.09599},
doi = {https://doi.org/10.48550/arXiv.2110.09599},
year = {2022},
date = {2022},
booktitle = {International Conference on Machine Learning},
pages = {8691-8707},
publisher = {PMLR},
abstract = {State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (i.e., patterns) that strongly correlate with correct resp. erroneous predictions to obtain a global and interpretable description for arbitrary classifiers. We show this is an instance of the more general label description problem, which we formulate in terms of the Minimum Description Length principle. To discover a good pattern set, we develop the efficient Premise algorithm. Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. Unlike existing solutions, it ably recovers ground truth patterns, even on highly imbalanced data over many features. Through two case studies on Visual Question Answering and Named Entity Recognition, we confirm that Premise gives clear and actionable insight into the systematic errors made by modern NLP classifiers.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Zouhar, Vilém; Mosbach, Marius; Zhang, Miaoran; Klakow, Dietrich

Knowledge Base Index Compression via Dimensionality and Precision Reduction Inproceedings

Spa-NLP workshop at ACL 2022, 22nd-27th May 2022 Dublin, Ireland, 2022.

Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction.
Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.

@inproceedings{Zouhar_2022_Base,
title = {Knowledge Base Index Compression via Dimensionality and Precision Reduction},
author = {Vil{\'e}m Zouhar and Marius Mosbach and Miaoran Zhang and Dietrich Klakow},
url = {https://arxiv.org/abs/2204.02906},
year = {2022},
date = {2022},
publisher = {Spa-NLP workshop at ACL 2022},
address = {22nd-27th May 2022 Dublin, Ireland},
abstract = {Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100× compression with 75%, and (2) 24× compression with 92% original retrieval performance.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Successfully