Unravelling Linguistic Knowledge via Multilingual Embedding Spaces and Latent Information
Embeddings (monolingual, multilingual, static, contextualized) are the workhorses of modern language technologies. They are based on the distributional hypothesis and can capture semantic, grammatical, morphological and other information. Most embeddings are now prediction-based (Mikolov et al., 2013). Embeddings can be at the word, sub-word, sentence, paragraph or document levels. They need (very) large amounts of data to be trained on. To date, most monolingual embeddings have been done for English and other such resource-rich languages, and similarly, multilingual embeddings usually involve English. Multilingual embeddings are especially promising (Devlin et al., 2019). Word translations are close in multilingual embedding spaces, sentence translations in sentence embedding spaces, and models allow fine-tuning and few- and zero-shot learning. Multilingual embeddings can be computed in a joint space or through alignment of monolingual spaces, and their final quality strongly depends on the languages and domains involved (Søgaard et al., 2018).
Mono- and multilingual embeddings constitute the core technology underpinning our previous work on translationese in B6 phase II. Our research in phase II showed for the first time that (i) departures from isomorphism between simple monolingual word embedding spaces computed from original and translated material allow us to detect translationese effects, to the extent that we can estimate phylogenetic trees between the source languages of the translations (Dutta Chowdhury et al., 2020, 2021); (ii) feature- and representation-learning approaches systematically outperform hand-crafted and linguistically inspired feature-engineering-based approaches on translationese classification (Pylypenko et al., 2021); and (iii) our feature- and representation-learning-based cross- and multilingual classification experiments provide empirical evidence of cross-language translationese universals (Pylypenko et al., 2021). Ranking of single hand-crafted features based on R2 of linear classifiers to predict output of the best-performing BERT model shows that language-model-based average surprisal (perplexity) features account significantly for parts of the variance of the neural model.
For the new B6 proposal our research goals are foundational as well as practical: building on B6 phase II, we extend our research on multilinguality and translationese, addressing theoretical as well as practical questions about (i) information spreading in embedding spaces, (ii) capturing translationese subspaces and (iii) extracting latent background knowledge from bilingual data. We seek to apply answers to the foundational questions to improve NLP applications, including in particular NLP for low-resource languages, machine translation and perhaps even general multilingual technologies. From a foundational point of view, we focus on what is captured by multilingual embeddings: what patterns are manifest in embedding data? Can we detect patterns (clusters) with and without linguistic labels? How do they compare? Do clusters naturally emergent in embedding space (without linguistic labels) correspond to linguistic typology? Where and why do they differ across languages? How do we capture situations where isomorphism between embedding spaces does not and should not hold? Can we identify, compute and use translationese subspaces? How can we automatically capture and quantify latent background knowledge from translations? Answers to these questions may support applications: to achieve optimal results in general as well as for low-resource multilingual models, should we cluster languages that pattern in a similar way (as in e.g. “cardinality-based” MT)? Which applications benefit from clustering? Can clustering optimize few- or zero-shot learning? Can properties of multilingual embedding spaces lead to better lexicon induction for self- and unsupervised machine translation supporting low-resource scenarios? Can translationese subspaces improve machine translation? Can capture of latent cultural background knowledge from translation and general multilingual data reduce perplexity of language models?
Importantly, we will explore to what extent our findings can be modelled or explained in terms of the information-theoretic concepts that take centre stage in the CRC, including entropy and surprisal: can clustering in multilingual embedding spaces be usefully described in terms of entropy? To what extent do applications of translationese subspaces register in terms of increased or decreased surprisal in translation output? To what extent can latent background knowledge be used to gain an improved notion of surprisal for the results of a translation? Our proposal targets Focus Area (3) ’Language typology, multilinguality, language change’ of phase III of the CRC.
Keywords: machine translation