Publications

Mosbach, Marius; Andriushchenko, Maksym; Klakow, Dietrich

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines Inproceedings

International Conference on Learning Representations, 2021.

Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches.

@inproceedings{mosbach2021on,
title = {On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines},
author = {Marius Mosbach and Maksym Andriushchenko and Dietrich Klakow},
url = {https://openreview.net/forum?id=nzpLWnVAyah},
year = {2021},
date = {2021},
booktitle = {International Conference on Learning Representations},
abstract = {Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Abdullah, Badr M.; Mosbach, Marius; Zaitova, Iuliia; Möbius, Bernd; Klakow, Dietrich

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study Inproceedings

Proceedings of Interspeech 2020, 2021.

Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

@inproceedings{Abdullah2021DoAW,
title = {Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study},
author = {Badr M. Abdullah and Marius Mosbach and Iuliia Zaitova and Bernd M{\"o}bius and Dietrich Klakow},
url = {https://arxiv.org/abs/2106.08686},
year = {2021},
date = {2021},
booktitle = {Proceedings of Interspeech 2020},
abstract = {Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   C4 B4

Mosbach, Marius; Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Abdullah, Badr M.; Klakow, Dietrich

A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English Inproceedings

Proceedings of the 28th International Conference on Computational Linguistics, pp. 771-787, 2020.

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

@inproceedings{Mosbach2020,
title = {A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English},
author = {Marius Mosbach and Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Badr M. Abdullah and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.coling-main.67.pdf},
year = {2020},
date = {2020-12-01},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
pages = {771-787},
abstract = {Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4 C4

Adelani, David; Hedderich, Michael; Zhu, Dawei; van Berg, Esther; Klakow, Dietrich

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá Miscellaneous

ArXiv, abs/2003.08370, 2020.

The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way.

Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios.

In this work, we perform named entity recognition for Hausa and Yorùbá, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier’s performance.

@miscellaneous{Adelani2020,
title = {Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùb{\'a}},
author = {David Adelani and Michael Hedderich and Dawei Zhu and Esther van Berg and Dietrich Klakow},
url = {https://arxiv.org/abs/2003.08370},
year = {2020},
date = {2020},
booktitle = {ArXiv},
abstract = {The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way. Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios. In this work, we perform named entity recognition for Hausa and Yorùb{\'a}, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier's performance.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B4

Hedderich, Michael; Adelani, David; Zhu, Dawei; Jesujoba , Alabi; Udia, Markus; Klakow, Dietrich

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages Inproceedings

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, pp. 2580-2591, 2020.

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.

@inproceedings{hedderich-etal-2020-transfer,
title = {Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages},
author = {Michael Hedderich and David Adelani and Dawei Zhu and Alabi Jesujoba and Markus Udia and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.emnlp-main.204},
doi = {https://doi.org/10.18653/v1/2020.emnlp-main.204},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages = {2580-2591},
publisher = {Association for Computational Linguistics},
abstract = {Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius; Khokhlova, Anna; Hedderich, Michael; Klakow, Dietrich

On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers Inproceedings

Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, pp. 2502-2516, 2020.

Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how fine-tuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our analysis reveals that while fine-tuning indeed changes the representations of a pre-trained model and these changes are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy that is larger than just using the pre-trained model with a strong pooling method. Based on our findings, we argue that both positive and negative effects of fine-tuning on probing require a careful interpretation.

@inproceedings{mosbach-etal-2020-interplay-fine,
title = {On the Interplay Between Fine-tuning and Sentence-level Probing for Linguistic Knowledge in Pre-trained Transformers},
author = {Marius Mosbach and Anna Khokhlova and Michael Hedderich and Dietrich Klakow},
url = {https://www.aclweb.org/anthology/2020.findings-emnlp.227},
doi = {https://doi.org/10.18653/v1/2020.findings-emnlp.227},
year = {2020},
date = {2020},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
pages = {2502-2516},
publisher = {Association for Computational Linguistics},
abstract = {Fine-tuning pre-trained contextualized embedding models has become an integral part of the NLP pipeline. At the same time, probing has emerged as a way to investigate the linguistic knowledge captured by pre-trained models. Very little is, however, understood about how fine-tuning affects the representations of pre-trained models and thereby the linguistic knowledge they encode. This paper contributes towards closing this gap. We study three different pre-trained models: BERT, RoBERTa, and ALBERT, and investigate through sentence-level probing how fine-tuning affects their representations. We find that for some probing tasks fine-tuning leads to substantial changes in accuracy, possibly suggesting that fine-tuning introduces or even removes linguistic knowledge from a pre-trained model. These changes, however, vary greatly across different models, fine-tuning and probing tasks. Our analysis reveals that while fine-tuning indeed changes the representations of a pre-trained model and these changes are typically larger for higher layers, only in very few cases, fine-tuning has a positive effect on probing accuracy that is larger than just using the pre-trained model with a strong pooling method. Based on our findings, we argue that both positive and negative effects of fine-tuning on probing require a careful interpretation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Biswas, Rajarshi; Mogadala, Aditya; Barz, Michael; Sonntag, Daniel; Klakow, Dietrich

Automatic Judgement of Neural Network-Generated Image Captions Inproceedings

7th International Conference on Statistical Language and Speech Processing (SLSP2019), Ljubljana, Slovenia, 2019.

@inproceedings{Biswas2019,
title = {Automatic Judgement of Neural Network-Generated Image Captions},
author = {Rajarshi Biswas and Aditya Mogadala and Michael Barz and Daniel Sonntag and Dietrich Klakow},
year = {2019},
date = {2019},
booktitle = {7th International Conference on Statistical Language and Speech Processing (SLSP2019)},
address = {Ljubljana, Slovenia},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Lange, Lukas; Hedderich, Michael; Klakow, Dietrich

Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels Inproceedings

Inui, Kentaro; Jiang, Jing; Ng, Vincent; Wan, Xiaojun (Ed.): Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, pp. 3552-3557, Hong Kong, China, 2019.

In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.

@inproceedings{lange-etal-2019-feature,
title = {Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels},
author = {Lukas Lange and Michael Hedderich and Dietrich Klakow},
editor = {Kentaro Inui and Jing Jiang and Vincent Ng and Xiaojun Wan},
url = {https://aclanthology.org/D19-1362/},
doi = {https://doi.org/10.18653/v1/D19-1362},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages = {3552-3557},
publisher = {Association for Computational Linguistics},
address = {Hong Kong, China},
abstract = {In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Mosbach, Marius; Stenger, Irina; Avgustinova, Tania; Klakow, Dietrich

incom.py - A Toolbox for Calculating Linguistic Distances and Asymmetries between Related Languages Inproceedings

Angelova, Galia; Mitkov, Ruslan; Nikolova, Ivelina; Temnikova, Irina (Ed.): Proceedings of Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2-4 September 2019, pp. 811-819, Varna, Bulgaria, 2019.

Languages may be differently distant from each other and their mutual intelligibility may be asymmetric. In this paper we introduce incom.py, a toolbox for calculating linguistic distances and asymmetries between related languages. incom.py allows linguist experts to quickly and easily perform statistical analyses and compare those with experimental results. We demonstrate the efficacy of incom.py in an incomprehension experiment on two Slavic languages: Bulgarian and Russian. Using incom.py we were able to validate three methods to measure linguistic distances and asymmetries: Levenshtein distance, word adaptation surprisal, and conditional entropy as predictors of success in a reading intercomprehension experiment.

@inproceedings{Mosbach2019,
title = {incom.py - A Toolbox for Calculating Linguistic Distances and Asymmetries between Related Languages},
author = {Marius Mosbach and Irina Stenger and Tania Avgustinova and Dietrich Klakow},
editor = {Galia Angelova and Ruslan Mitkov and Ivelina Nikolova and Irina Temnikova},
url = {https://aclanthology.org/R19-1094/},
doi = {https://doi.org/10.26615/978-954-452-056-4_094},
year = {2019},
date = {2019},
booktitle = {Proceedings of Recent Advances in Natural Language Processing, RANLP 2019, Varna, Bulgaria, 2-4 September 2019},
pages = {811-819},
address = {Varna, Bulgaria},
abstract = {Languages may be differently distant from each other and their mutual intelligibility may be asymmetric. In this paper we introduce incom.py, a toolbox for calculating linguistic distances and asymmetries between related languages. incom.py allows linguist experts to quickly and easily perform statistical analyses and compare those with experimental results. We demonstrate the efficacy of incom.py in an incomprehension experiment on two Slavic languages: Bulgarian and Russian. Using incom.py we were able to validate three methods to measure linguistic distances and asymmetries: Levenshtein distance, word adaptation surprisal, and conditional entropy as predictors of success in a reading intercomprehension experiment.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B4 C4

Grosse, Kathrin; Trost, Thomas; Mosbach, Marius; Backes, Michael; Klakow, Dietrich

On the security relevance of weights in deep learning Journal Article

arXiv, Cornell University, 2019.

@article{Grosse2019,
title = {On the security relevance of weights in deep learning},
author = {Kathrin Grosse and Thomas Trost and Marius Mosbach and Michael Backes and Dietrich Klakow},
url = {https://arxiv.org/abs/1902.03020},
year = {2019},
date = {2019-02-08},
journal = {arXiv, Cornell University},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Oualil, Youssef

Sequential estimation techniques and application to multiple speaker tracking and language modeling PhD Thesis

Saarland University, Saarbruecken, Germany, 2017.

For many real-word applications, the considered data is given as a time sequence that becomes available in an orderly fashion, where the order incorporates important information about the entities of interest. The work presented in this thesis deals with two such cases by introducing new sequential estimation solutions. More precisely, we introduce a: I. Sequential Bayesian estimation framework to solve the multiple speaker localization, detection and tracking problem. This framework is a complete pipeline that includes 1) new observation estimators, which extract a fixed number of potential locations per time frame; 2) new unsupervised Bayesian detectors, which classify these estimates into noise/speaker classes and 3) new Bayesian filters, which use the speaker class estimates to track multiple speakers.

This framework was developed to tackle the low overlap detection rate of multiple speakers and to reduce the number of constraints generally imposed in standard solutions. II. Sequential neural estimation framework for language modeling, which overcomes some of the shortcomings of standard approaches through merging of different models in a hybrid architecture. That is, we introduce two solutions that tightly merge particular models and then show how a generalization can be achieved through a new mixture model. In order to speed-up the training of large vocabulary language models, we introduce a new extension of the noise contrastive estimation approach to batch training.

@phdthesis{Oualil2017b,
title = {Sequential estimation techniques and application to multiple speaker tracking and language modeling},
author = {Youssef Oualil},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291-scidok-ds-272280},
doi = {https://doi.org/http://dx.doi.org/10.22028/D291-27228},
year = {2017},
date = {2017},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {For many real-word applications, the considered data is given as a time sequence that becomes available in an orderly fashion, where the order incorporates important information about the entities of interest. The work presented in this thesis deals with two such cases by introducing new sequential estimation solutions. More precisely, we introduce a: I. Sequential Bayesian estimation framework to solve the multiple speaker localization, detection and tracking problem. This framework is a complete pipeline that includes 1) new observation estimators, which extract a fixed number of potential locations per time frame; 2) new unsupervised Bayesian detectors, which classify these estimates into noise/speaker classes and 3) new Bayesian filters, which use the speaker class estimates to track multiple speakers. This framework was developed to tackle the low overlap detection rate of multiple speakers and to reduce the number of constraints generally imposed in standard solutions. II. Sequential neural estimation framework for language modeling, which overcomes some of the shortcomings of standard approaches through merging of different models in a hybrid architecture. That is, we introduce two solutions that tightly merge particular models and then show how a generalization can be achieved through a new mixture model. In order to speed-up the training of large vocabulary language models, we introduce a new extension of the noise contrastive estimation approach to batch training.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B4

Singh, Mittul; Oualil, Youssef; Klakow, Dietrich

Approximated and domain-adapted LSTM language models for first-pass decoding in speech recognition Inproceedings

18th Annual Conference of the International Speech Communication Association (INTERSPEECH), Stockholm, Sweden, 2017.

@inproceedings{Singh2017,
title = {Approximated and domain-adapted LSTM language models for first-pass decoding in speech recognition},
author = {Mittul Singh and Youssef Oualil and Dietrich Klakow},
year = {2017},
date = {2017},
publisher = {18th Annual Conference of the International Speech Communication Association (INTERSPEECH)},
address = {Stockholm, Sweden},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Klakow, Dietrich; Trost, Thomas

Parameter Free Hierarchical Graph-Based Clustering for Analyzing Continuous Word Embeddings Inproceedings

In Workshop Proceedings of TextGraphs-11: Graph-based Methods for Natural Language Processing (Workshop at ACL 2017), 2017.

@inproceedings{TroKla2017,
title = {Parameter Free Hierarchical Graph-Based Clustering for Analyzing Continuous Word Embeddings},
author = {Dietrich Klakow andThomas Trost},
year = {2017},
date = {2017},
booktitle = {In Workshop Proceedings of TextGraphs-11: Graph-based Methods for Natural Language Processing (Workshop at ACL 2017)},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Oualil, Youssef; Klakow, Dietrich

A batch noise contrastive estimation approach for training large vocabulary language models Inproceedings

18th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017.

@inproceedings{Oualil2017,
title = {A batch noise contrastive estimation approach for training large vocabulary language models},
author = {Youssef Oualil and Dietrich Klakow},
year = {2017},
date = {2017},
publisher = {18th Annual Conference of the International Speech Communication Association (INTERSPEECH)},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Singh, Mittul; Greenberg, Clayton; Oualil, Youssef; Klakow, Dietrich

Sub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling Inproceedings

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, Osaka, Japan, 2016.

Training good word embeddings requires large amounts of data. Out-of-vocabulary words will still be encountered at test-time, leaving these words without embeddings. To overcome this lack of embeddings for rare words, existing methods leverage morphological features to generate embeddings. While the existing methods use computationally-intensive rule-based (Soricut and Och, 2015) or tool-based (Botha and Blunsom, 2014) morphological analysis to generate embeddings, our system applies a computationally-simpler sub-word search on words that have existing embeddings.

Embeddings of the sub-word search results are then combined using string similarity functions to generate rare word embeddings. We augmented pre-trained word embeddings with these novel embeddings and evaluated on a rare word similarity task, obtaining up to 3 times improvement in correlation over the original set of embeddings. Applying our technique to embeddings trained on larger datasets led to on-par performance with the existing state-of-the-art for this task. Additionally, while analysing augmented embeddings in a log-bilinear language model, we observed up to 50% reduction in rare word perplexity in comparison to other more complex language models.

@inproceedings{singh-EtAl:2016:COLING1,
title = {Sub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling},
author = {Mittul Singh and Clayton Greenberg and Youssef Oualil and Dietrich Klakow},
url = {http://aclweb.org/anthology/C16-1194},
year = {2016},
date = {2016-12-01},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
publisher = {The COLING 2016 Organizing Committee},
address = {Osaka, Japan},
abstract = {Training good word embeddings requires large amounts of data. Out-of-vocabulary words will still be encountered at test-time, leaving these words without embeddings. To overcome this lack of embeddings for rare words, existing methods leverage morphological features to generate embeddings. While the existing methods use computationally-intensive rule-based (Soricut and Och, 2015) or tool-based (Botha and Blunsom, 2014) morphological analysis to generate embeddings, our system applies a computationally-simpler sub-word search on words that have existing embeddings. Embeddings of the sub-word search results are then combined using string similarity functions to generate rare word embeddings. We augmented pre-trained word embeddings with these novel embeddings and evaluated on a rare word similarity task, obtaining up to 3 times improvement in correlation over the original set of embeddings. Applying our technique to embeddings trained on larger datasets led to on-par performance with the existing state-of-the-art for this task. Additionally, while analysing augmented embeddings in a log-bilinear language model, we observed up to 50% reduction in rare word perplexity in comparison to other more complex language models.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Singh, Mittul; Greenberg, Clayton; Klakow, Dietrich

The Custom Decay Language Model for Long Range Dependencies Book Chapter

Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings, Springer International Publishing, pp. 343-351, Cham, 2016, ISBN 978-3-319-45510-5.

@inbook{Singh2016,
title = {The Custom Decay Language Model for Long Range Dependencies},
author = {Mittul Singh and Clayton Greenberg and Dietrich Klakow},
url = {http://dx.doi.org/10.1007/978-3-319-45510-5_39},
doi = {https://doi.org/10.1007/978-3-319-45510-5_39},
year = {2016},
date = {2016},
booktitle = {Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings},
isbn = {978-3-319-45510-5},
pages = {343-351},
publisher = {Springer International Publishing},
address = {Cham},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B4

Oualil, Youssef; Greenberg, Clayton; Singh, Mittul; Klakow, Dietrich; Oualil, Youssef; Mittul, Singh

Sequential recurrent neural networks for language modeling Journal Article

Interspeech 2016, pp. 3509-3513, 2016.

@article{oualil2016sequential,
title = {Sequential recurrent neural networks for language modeling},
author = {Youssef Oualil and Clayton Greenberg and Mittul Singh and Dietrich Klakow andYoussef Oualil and Singh Mittul},
year = {2016},
date = {2016},
journal = {Interspeech 2016},
pages = {3509-3513},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B4

Sayeed, Asad; Greenberg, Clayton; Demberg, Vera

Thematic fit evaluation: an aspect of selectional preferences Journal Article

Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP, pp. 99-105, 2016, ISBN 9781945626142.

@article{Sayeed2016,
title = {Thematic fit evaluation: an aspect of selectional preferences},
author = {Asad Sayeed and Clayton Greenberg and Vera Demberg},
year = {2016},
date = {2016},
journal = {Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP},
pages = {99-105},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   B2 B4

Oualil, Youssef; Singh, Mittul; Greenberg, Clayton; Klakow, Dietrich

Long-short range context neural networks for language models Inproceedings

EMLP 2016, 2016.

@inproceedings{Oualil2016,
title = {Long-short range context neural networks for language models},
author = {Youssef Oualil and Mittul Singh and Clayton Greenberg and Dietrich Klakow},
year = {2016},
date = {2016},
publisher = {EMLP 2016},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Successfully