Dutta Chowdhury, Koel; España i Bonet, Cristina; van Genabith, Josef

Understanding Translationese in Multi-view Embedding Spaces

Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, pp. 6056-6062, Barcelona, Catalonia (Online), 2020.

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {—} words, parts of speech, semantic tags and synsets {—} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just“ due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.