Publications

Dutta Chowdhury, Koel; Jalota, Rricha; van Genabith, Josef; España i Bonet, Cristina

Towards Debiasing Translation Artifacts Inproceedings Forthcoming

NAACL 2022 DFKI & SFB 1102, Seattle, Washington, 2022.

@inproceedings{Chowdhury_2022_Debiasing,
title = {Towards Debiasing Translation Artifacts},
author = {Koel Dutta Chowdhury and Rricha Jalota and Josef van Genabith and Cristina Espa{\~n}a i Bonet},
url = {https://2022.naacl.org/?msclkid=d01343eec0e211ec848495b8921c6e80},
year = {2022},
date = {2022},
publisher = {NAACL 2022 DFKI & SFB 1102},
address = {Seattle, Washington},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Amponsah-Kaakyire, Kwabena; Pylypenko, Daria; España i Bonet, Cristina; van Genabith, Josef

Do not Rely on Relay Translations: Multilingual Parallel Direct Europarl Inproceedings

Proceedings of the Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21), International Committee on Computational Linguistics, pp. 1-7, Iceland (Online), 2021.

Translationese data is a scarce and valuable resource. Traditionally, the proceedings of the European Parliament have been used for studying translationese phenomena since their metadata allows to distinguish between original and translated texts. However, translations are not always direct and we hypothesise that a pivot (also called ”relay”) language might alter the conclusions on translationese effects. In this work, we (i) isolate translations that have been done without an intermediate language in the Europarl proceedings from those that might have used a pivot language, and (ii) build comparable and parallel corpora with data aligned across multiple languages that therefore can be used for both machine translation and translation studies.

@inproceedings{AmposahEtal:MOTRA:2021,
title = {Do not Rely on Relay Translations: Multilingual Parallel Direct Europarl},
author = {Kwabena Amponsah-Kaakyire and Daria Pylypenko and Cristina Espa{\~n}a i Bonet and Josef van Genabith},
url = {https://aclanthology.org/2021.motra-1.1/},
year = {2021},
date = {2021},
booktitle = {Proceedings of the Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21)},
pages = {1-7},
publisher = {International Committee on Computational Linguistics},
address = {Iceland (Online)},
abstract = {Translationese data is a scarce and valuable resource. Traditionally, the proceedings of the European Parliament have been used for studying translationese phenomena since their metadata allows to distinguish between original and translated texts. However, translations are not always direct and we hypothesise that a pivot (also called ”relay”) language might alter the conclusions on translationese effects. In this work, we (i) isolate translations that have been done without an intermediate language in the Europarl proceedings from those that might have used a pivot language, and (ii) build comparable and parallel corpora with data aligned across multiple languages that therefore can be used for both machine translation and translation studies.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Pylypenko, Daria; Amponsah-Kaakyire, Kwabena; Dutta Chowdhury, Koel; van Genabith, Josef; España i Bonet, Cristina

Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification Inproceedings Forthcoming

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online and in the Dominican Republic, 2021.

@inproceedings{Pylypenko2021comparing,
title = {Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification},
author = {Daria Pylypenko and Kwabena Amponsah-Kaakyire and Koel Dutta Chowdhury and Josef van Genabith and Cristina Espa{\~n}a i Bonet},
year = {2021},
date = {2021},
booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
address = {Online and in the Dominican Republic},
pubstate = {forthcoming},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Dutta Chowdhury, Koel; España i Bonet, Cristina; van Genabith, Josef

Tracing Source Language Interference in Translation with Graph-Isomorphism Measures Inproceedings

Proceedings of Recent Advances in Natural Language Processing (RANLP 2021), pp. 380-390, Online, 2021, ISSN 2603-2813.

Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent research has shown that instances of translationese (source language interference) can even be detected in embedding spaces, comparing embeddings spaces of original language data with embedding spaces resulting from translations into the same language, using a simple Eigenvectorbased divergence from isomorphism measure. To date, it remains an open question whether alternative graph-isomorphism measures can produce better results. In this paper, we (i) explore Gromov-Hausdorff distance, (ii) present a novel spectral version of the Eigenvectorbased method, and (iii) evaluate all approaches against a broad linguistic typological database (URIEL). We show that language distances resulting from our spectral isomorphism approaches can reproduce genetic trees on a par with previous work without requiring any explicit linguistic information and that the results can be extended to non-Indo-European languages. Finally, we show that the methods are robust under a variety of modeling conditions.

@inproceedings{Chowdhury2021tracing,
title = {Tracing Source Language Interference in Translation with Graph-Isomorphism Measures},
author = {Koel Dutta Chowdhury and Cristina Espa{\~n}a i Bonet and Josef van Genabith},
url = {https://aclanthology.org/2021.ranlp-1.43/},
year = {2021},
date = {2021},
booktitle = {Proceedings of Recent Advances in Natural Language Processing (RANLP 2021)},
issn = {2603-2813},
pages = {380-390},
address = {Online},
abstract = {Previous research has used linguistic features to show that translations exhibit traces of source language interference and that phylogenetic trees between languages can be reconstructed from the results of translations into the same language. Recent research has shown that instances of translationese (source language interference) can even be detected in embedding spaces, comparing embeddings spaces of original language data with embedding spaces resulting from translations into the same language, using a simple Eigenvectorbased divergence from isomorphism measure. To date, it remains an open question whether alternative graph-isomorphism measures can produce better results. In this paper, we (i) explore Gromov-Hausdorff distance, (ii) present a novel spectral version of the Eigenvectorbased method, and (iii) evaluate all approaches against a broad linguistic typological database (URIEL). We show that language distances resulting from our spectral isomorphism approaches can reproduce genetic trees on a par with previous work without requiring any explicit linguistic information and that the results can be extended to non-Indo-European languages. Finally, we show that the methods are robust under a variety of modeling conditions.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Bizzoni, Yuri; Juzek, Tom; España i Bonet, Cristina; Dutta Chowdhury, Koel; van Genabith, Josef; Teich, Elke

How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech Inproceedings

The 17th International Workshop on Spoken Language Translation, Seattle, WA, United States, 2020.

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons.

In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).

@inproceedings{Bizzoni2020,
title = {How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech},
author = {Yuri Bizzoni and Tom Juzek and Cristina Espa{\~n}a i Bonet and Koel Dutta Chowdhury and Josef van Genabith and Elke Teich},
year = {2020},
date = {2020},
booktitle = {The 17th International Workshop on Spoken Language Translation},
address = {Seattle, WA, United States},
abstract = {Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B6 B7

Dutta Chowdhury, Koel; España i Bonet, Cristina; van Genabith, Josef

Understanding Translationese in Multi-view Embedding Spaces Inproceedings

Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, pp. 6056-6062, Barcelona, Catalonia (Online), 2020.

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {—} words, parts of speech, semantic tags and synsets {—} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just“ due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

@inproceedings{DuttaEtal:COLING:2020,
title = {Understanding Translationese in Multi-view Embedding Spaces},
author = {Koel Dutta Chowdhury and Cristina Espa{\~n}a i Bonet and Josef van Genabith},
url = {https://www.aclweb.org/anthology/2020.coling-main.532/},
doi = {https://doi.org/10.18653/v1/2020.coling-main.532},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
pages = {6056-6062},
publisher = {International Committee on Computational Linguistics},
address = {Barcelona, Catalonia (Online)},
abstract = {Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data {---} words, parts of speech, semantic tags and synsets {---} to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not “just“ due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

van Genabith, Josef; España i Bonet, Cristina; Lapshinova-Koltunski, Ekaterina

Analysing Coreference in Transformer Outputs Inproceedings

Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Association for Computational Linguistics, pp. 1-12, Hong Kong, China, 2019.

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

@inproceedings{lapshinovaEtal:2019iscoMT,
title = {Analysing Coreference in Transformer Outputs},
author = {Josef van Genabith and Cristina Espa{\~n}a i Bonet andEkaterina Lapshinova-Koltunski},
url = {https://www.aclweb.org/anthology/D19-6501},
doi = {https://doi.org/10.18653/v1/D19-6501},
year = {2019},
date = {2019},
booktitle = {Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)},
pages = {1-12},
publisher = {Association for Computational Linguistics},
address = {Hong Kong, China},
abstract = {We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Rubino, Raphael; Degaetano-Ortlieb, Stefania; Teich, Elke; van Genabith, Josef

Modeling Diachronic Change in Scientific Writing with Information Density Inproceedings

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, The COLING 2016 Organizing Committee, pp. 750-761, 2016.

@inproceedings{C16-1072,
title = {Modeling Diachronic Change in Scientific Writing with Information Density},
author = {Raphael Rubino and Stefania Degaetano-Ortlieb and Elke Teich and Josef van Genabith},
url = {http://aclweb.org/anthology/C16-1072},
year = {2016},
date = {2016},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers},
pages = {750-761},
publisher = {The COLING 2016 Organizing Committee},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Rubino, Raphael; Lapshinova-Koltunski, Ekaterina; van Genabith, Josef

Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification Inproceedings

Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, pp. 960-970, 2016.

@inproceedings{N16-1110,
title = {Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification},
author = {Raphael Rubino and Ekaterina Lapshinova-Koltunski and Josef van Genabith},
url = {http://aclweb.org/anthology/N16-1110},
doi = {https://doi.org/10.18653/v1/N16-1110},
year = {2016},
date = {2016},
booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
pages = {960-970},
publisher = {Association for Computational Linguistics},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Findings of the 2016 Conference on Machine Translation Inproceedings

Proceedings of the First Conference on Machine Translation, Association for Computational Linguistics, pp. 131-198, Berlin, Germany, 2016.

@inproceedings{bojar-EtAl:2016:WMT1,
title = {Findings of the 2016 Conference on Machine Translation},
author = {Ondvrej Bojar and Rajen Chatterjee and Christian Federmann and Yvette Graham and Barry Haddow and Matthias Huck and Antonio Jimeno Yepes and Philipp Koehn and Varvara Logacheva and Christof Monz and Matteo Negri and Aurelie Neveol and Mariana Neves and Martin Popel and Matt Post and Raphael Rubino and Carolina Scarton and Lucia Specia and Marco Turchi and Karin Verspoor and Marcos Zampieri},
url = {http://www.aclweb.org/anthology/W/W16/W16-2301},
year = {2016},
date = {2016-08-01},
booktitle = {Proceedings of the First Conference on Machine Translation},
pages = {131-198},
publisher = {Association for Computational Linguistics},
address = {Berlin, Germany},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Successfully