Publications

Kunilovskaya, Maria; Dutta Chowdhury, Koel; Przybyl, Heike; España-Bonet, Cristina; van Genabith, Josef

Mitigating Translationese with GPT-4: Strategies and Performance Inproceedings

Proceedings of the 25th Annual Conference of the European Association for Machine Translation, 1, European Association for Machine Translation, pp. 411–430, 2024.

Translations differ in systematic ways from texts originally authored in the same language. These differences, collectively known as translationese, can pose challenges in cross-lingual natural language processing: models trained or tested on translated input might struggle when presented with non-translated language.Translationese mitigation can alleviate this problem. This study investigates the generative capacities of GPT-4 to reduce translationese in human-translated texts. The task is framed as a rewriting process aimed
at modified translations indistinguishable from the original text in the target language. Our focus is on prompt engineering that tests the utility of linguistic knowledge as part of the instruction for GPT-4. Through a series of prompt design experiments, we show that GPT4-generated revisions are more similar to originals in the target language when the prompts incorporate specific linguistic instructions instead of relying solely on the model’s internal knowledge. Furthermore, we release the segment-aligned bidirectional German–English data built from the Europarl corpus that underpins this study.

@inproceedings{kunilovskaya-etal-2024-mitigating,
title = {Mitigating Translationese with GPT-4: Strategies and Performance},
author = {Maria Kunilovskaya and Koel Dutta Chowdhury and Heike Przybyl and Cristina Espa{\~n}a-Bonet and Josef van Genabith},
url = {https://eamt2024.github.io/proceedings/vol1.pdf},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 25th Annual Conference of the European Association for Machine Translation},
pages = {411–430},
publisher = {European Association for Machine Translation},
abstract = {Translations differ in systematic ways from texts originally authored in the same language. These differences, collectively known as translationese, can pose challenges in cross-lingual natural language processing: models trained or tested on translated input might struggle when presented with non-translated language.Translationese mitigation can alleviate this problem. This study investigates the generative capacities of GPT-4 to reduce translationese in human-translated texts. The task is framed as a rewriting process aimed at modified translations indistinguishable from the original text in the target language. Our focus is on prompt engineering that tests the utility of linguistic knowledge as part of the instruction for GPT-4. Through a series of prompt design experiments, we show that GPT4-generated revisions are more similar to originals in the target language when the prompts incorporate specific linguistic instructions instead of relying solely on the model’s internal knowledge. Furthermore, we release the segment-aligned bidirectional German–English data built from the Europarl corpus that underpins this study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B6 B7

Fischer, Stefan; Haidarzhyi, Kateryna; Knappen, Jörg; Polishchuk, Olha; Stodolinska, Yuliya; Teich, Elke

A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication Inproceedings

Romanyshyn, Mariana; Romanyshyn, Nataliia; Hlybovets, Andrii; Ignatenko, Oleksii (Ed.): Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, ELRA and ICCL, pp. 1-7, Torino, Italia, 2024.

We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbrücken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.

@inproceedings{fischer-etal-2024-contemporary,
title = {A Contemporary News Corpus of Ukrainian (CNC-UA): Compilation, Annotation, Publication},
author = {Stefan Fischer and Kateryna Haidarzhyi and J{\"o}rg Knappen and Olha Polishchuk and Yuliya Stodolinska and Elke Teich},
editor = {Mariana Romanyshyn and Nataliia Romanyshyn and Andrii Hlybovets and Oleksii Ignatenko},
url = {https://aclanthology.org/2024.unlp-1.1},
year = {2024},
date = {2024},
booktitle = {Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024},
pages = {1-7},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {We present a corpus of contemporary Ukrainian news articles published between 2019 and 2022 on the news website of the national public broadcaster of Ukraine, commonly known as SUSPILNE. The current release comprises 87 210 364 words in 292 955 texts. Texts are annotated with titles and their time of publication. In addition, the corpus has been linguistically annotated at the token level with a dependency parser. To provide further aspects for investigation, a topic model was trained on the corpus. The corpus is hosted (Fischer et al., 2023) at the Saarbr{\"u}cken CLARIN center under a CC BY-NC-ND 4.0 license and available in two tab-separated formats: CoNLL-U (de Marneffe et al., 2021) and vertical text format (VRT) as used by the IMS Open Corpus Workbench (CWB; Evert and Hardie, 2011) and CQPweb (Hardie, 2012). We show examples of using the CQPweb interface, which allows to extract the quantitative data necessary for distributional and collocation analyses of the CNC-UA. As the CNC-UA contains news texts documenting recent events, it is highly relevant not only for linguistic analyses of the modern Ukrainian language but also for socio-cultural and political studies.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Menzel, Katrin

Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German Inproceedings

Fiser, Darja; Eskevich, Maria; Bordon, David (Ed.): Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024, ELRA and ICCL, pp. 57-65, Torino, Italia, 2024.

This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.

@inproceedings{menzel-2024-exploring,
title = {Exploring Word Formation Trends in Written, Spoken, Translated and Interpreted European Parliament Data - A Case Study on Initialisms in English and German},
author = {Katrin Menzel},
editor = {Darja Fiser and Maria Eskevich and David Bordon},
url = {https://aclanthology.org/2024.parlaclarin-1.9},
year = {2024},
date = {2024},
booktitle = {Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024},
pages = {57-65},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {This paper demonstrates the research potential of a unique European Parliament dataset for register studies, contrastive linguistics, translation and interpreting studies. The dataset consists of parallel data for several European languages, including written source texts and their translations as well as spoken source texts and the transcripts of their simultaneously interpreted versions. The paper presents a cross-linguistic, corpus-based case study on a word formation phenomenon in these European Parliament data that are enriched with various linguistic annotations and metadata as well as with information-theoretic surprisal scores. It addresses the questions of how initialisms are used across languages and production modes in the English and German corpus sections of these European Parliament data, whether there is a correlation between the use of initialisms and the use of their corresponding multiword full forms in the analysed corpus sections and what insights on the informativity and possible processing difficulties of initialisms we can gain from an analysis of information-theoretic surprisal values. The results show that English written originals and German translations are the corpus sections with the highest frequencies of initialisms. The majority of cross-language transfer situations lead to fewer initialisms in the target texts than in the source texts. In the English data, there is a positive correlation between the frequency of initialisms and the frequency of the respective full forms. There is a similar correlation in the German data, apart from the interpreted data. Additionally, the results show that initialisms represent peaks of information with regard to their surprisal values within their segments. Particularly the German data show higher surprisal values of initialisms in mediated language than in non-mediated discourse types, which indicates that in German mediated discourse, initialisms tend to be used in less conventionalised textual contexts than in English.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Kunilovskaya, Maria; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina; Teich, Elke

Simultaneous Interpreting as a Noisy Channel: How Much Information Gets Through Inproceedings

Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, INCOMA Ltd., Shoumen, Bulgaria, pp. 608–618, Varna, Bulgaria, 2023.

We explore the relationship between information density/surprisal of source and target texts in translation and interpreting in the language pair English-German, looking at the specific properties of translation (“translationese”). Our data comes from two bidirectional English-German subcorpora representing written and spoken mediation modes collected from European Parliament proceedings. Within each language, we (a) compare original speeches to their translated or interpreted counterparts, and (b) explore the association between segment-aligned sources and targets in each translation direction. As additional variables, we consider source delivery mode (read-out, impromptu) and source speech rate in interpreting. We use language modelling to measure the information rendered by words in a segment and to characterise the cross-lingual transfer of information under various conditions. Our approach is based on statistical analyses of surprisal values, extracted from ngram models of our dataset. The analysis reveals that while there is a considerable positive correlation between the average surprisal of source and target segments in both modes, information output in interpreting is lower than in translation, given the same amount of input. Significantly lower information density in spoken mediated production compared to nonmediated speech in the same language can indicate a possible simplification effect in interpreting.

@inproceedings{kunilovskaya-etal-2023,
title = {Simultaneous Interpreting as a Noisy Channel: How Much Information Gets Through},
author = {Maria Kunilovskaya and Heike Przybyl and Ekaterina Lapshinova-Koltunski and Elke Teich},
url = {https://aclanthology.org/2023.ranlp-1.66/},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing},
pages = {608–618},
publisher = {INCOMA Ltd., Shoumen, Bulgaria},
address = {Varna, Bulgaria},
abstract = {We explore the relationship between information density/surprisal of source and target texts in translation and interpreting in the language pair English-German, looking at the specific properties of translation (“translationese”). Our data comes from two bidirectional English-German subcorpora representing written and spoken mediation modes collected from European Parliament proceedings. Within each language, we (a) compare original speeches to their translated or interpreted counterparts, and (b) explore the association between segment-aligned sources and targets in each translation direction. As additional variables, we consider source delivery mode (read-out, impromptu) and source speech rate in interpreting. We use language modelling to measure the information rendered by words in a segment and to characterise the cross-lingual transfer of information under various conditions. Our approach is based on statistical analyses of surprisal values, extracted from ngram models of our dataset. The analysis reveals that while there is a considerable positive correlation between the average surprisal of source and target segments in both modes, information output in interpreting is lower than in translation, given the same amount of input. Significantly lower information density in spoken mediated production compared to nonmediated speech in the same language can indicate a possible simplification effect in interpreting.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Yung, Frances Pik Yu; Scholman, Merel; Lapshinova-Koltunski, Ekaterina; Pollkläsener, Christina; Demberg, Vera

Investigating Explicitation of Discourse Connectives in Translation Using Automatic Annotations Inproceedings

Stoyanchev, Svetlana; Joty, Shafiq; Schlangen, David; Dusek, Ondrej; Kennington, Casey; Alikhani, Malihe (Ed.): Proceedings of the 24th Meeting of Special Interest Group on Discourse and Dialogue (SIGDAIL), Association for Computational Linguistics, pp. 21-30, Prague, Czechia, 2023.

Discourse relations have different patterns of marking across different languages. As a result, discourse connectives are often added, omitted, or rephrased in translation. Prior work has shown a tendency for explicitation of discourse connectives, but such work was conducted using restricted sample sizes due to difficulty of connective identification and alignment. The current study exploits automatic methods to facilitate a large-scale study of connectives in English and German parallel texts. Our results based on over 300 types and 18000 instances of aligned connectives and an empirical approach to compare the cross-lingual specificity gap provide strong evidence of the Explicitation Hypothesis. We conclude that discourse relations are indeed more explicit in translation than texts written originally in the same language. Automatic annotations allow us to carry out translation studies of discourse relations on a large scale. Our methodology using relative entropy to study the specificity of connectives also provides more fine-grained insights into translation patterns.

@inproceedings{yung-etal-2023-investigating,
title = {Investigating Explicitation of Discourse Connectives in Translation Using Automatic Annotations},
author = {Frances Pik Yu Yung and Merel Scholman and Ekaterina Lapshinova-Koltunski and Christina Pollkl{\"a}sener and Vera Demberg},
editor = {Svetlana Stoyanchev and Shafiq Joty and David Schlangen and Ondrej Dusek and Casey Kennington and Malihe Alikhani},
url = {https://aclanthology.org/2023.sigdial-1.2},
doi = {https://doi.org/10.18653/v1/2023.sigdial-1.2},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 24th Meeting of Special Interest Group on Discourse and Dialogue (SIGDAIL)},
pages = {21-30},
publisher = {Association for Computational Linguistics},
address = {Prague, Czechia},
abstract = {Discourse relations have different patterns of marking across different languages. As a result, discourse connectives are often added, omitted, or rephrased in translation. Prior work has shown a tendency for explicitation of discourse connectives, but such work was conducted using restricted sample sizes due to difficulty of connective identification and alignment. The current study exploits automatic methods to facilitate a large-scale study of connectives in English and German parallel texts. Our results based on over 300 types and 18000 instances of aligned connectives and an empirical approach to compare the cross-lingual specificity gap provide strong evidence of the Explicitation Hypothesis. We conclude that discourse relations are indeed more explicit in translation than texts written originally in the same language. Automatic annotations allow us to carry out translation studies of discourse relations on a large scale. Our methodology using relative entropy to study the specificity of connectives also provides more fine-grained insights into translation patterns.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B2 B7

Przybyl, Heike; Karakanta, Alina; Menzel, Katrin; Teich, Elke

Exploring linguistic variation in mediated discourse: translation vs. interpreting Book Chapter

Kajzer-Wietrzny, Marta; Bernardini, Silvia; Ferraresi, Adriano; Ivaska, Ilmari;  (Ed.): Mediated discourse at the European Parliament: Empirical investigations, Language Science Press, pp. 191–218, Berlin, 2022.

This paper focuses on the distinctive features of translated and interpreted texts in specific language combinations as forms of mediated discourse at the European Parliament. We aim to contribute to the long line of research on the specific properties of translation/interpreting. Specifically, we are interested in mediation effects (translation vs. interpreting) vs. effects of discourse mode (written vs. spoken). We propose a data-driven, exploratory approach to detecting and evaluating linguistic features as typical of translation/interpreting. Our approach utilizes simple wordbased 𝑛-gram language models combined with the information-theoretic measure of relative entropy, a standard measure of similarity/difference between probability distributions, applied here as a method of corpus comparison. Comparing translation
and interpreting (including the relation to their originals), we confirm the previously observed overall trend of written vs. spoken mode being strongly reflected in the translation and interpreting output. In addition, we detect some new features, such as a tendency towards more general lexemes in the verbal domain in interpreting or features of nominal style in translation.

@inbook{Przybyl2021exploring,
title = {Exploring linguistic variation in mediated discourse: translation vs. interpreting},
author = {Heike Przybyl and Alina Karakanta and Katrin Menzel and Elke Teich},
editor = {Marta Kajzer-Wietrzny and Silvia Bernardini and Adriano Ferraresi and Ilmari Ivaska},
url = {https://langsci-press.org/catalog/book/343},
doi = {https://doi.org/10.5281/zenodo.6977050},
year = {2022},
date = {2022},
booktitle = {Mediated discourse at the European Parliament: Empirical investigations},
pages = {191–218},
publisher = {Language Science Press},
address = {Berlin},
abstract = {This paper focuses on the distinctive features of translated and interpreted texts in specific language combinations as forms of mediated discourse at the European Parliament. We aim to contribute to the long line of research on the specific properties of translation/interpreting. Specifically, we are interested in mediation effects (translation vs. interpreting) vs. effects of discourse mode (written vs. spoken). We propose a data-driven, exploratory approach to detecting and evaluating linguistic features as typical of translation/interpreting. Our approach utilizes simple wordbased 𝑛-gram language models combined with the information-theoretic measure of relative entropy, a standard measure of similarity/difference between probability distributions, applied here as a method of corpus comparison. Comparing translation and interpreting (including the relation to their originals), we confirm the previously observed overall trend of written vs. spoken mode being strongly reflected in the translation and interpreting output. In addition, we detect some new features, such as a tendency towards more general lexemes in the verbal domain in interpreting or features of nominal style in translation.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B7

Lapshinova-Koltunski, Ekaterina; Pollkläsener, Christina; Przybyl, Heike

Exploring Explicitation and Implicitation in Parallel Interpreting and Translation Corpora Journal Article

The Prague Bulletin of Mathematical Linguistics, 119, pp. 5-22, 2022, ISSN 0032-6585.

We present a study of discourse connectives in English-German and German-English translation and interpreting where we focus on the phenomena of explicitation and implicitation.
Apart from distributional analysis of translation patterns in parallel data, we also look into surprisal, i.e. an information-theoretic measure of cognitive effort, which helps us to interpret the observed tendencies.

@article{lapshinova-koltunski-pollklaesener-przybyl:2022,
title = {Exploring Explicitation and Implicitation in Parallel Interpreting and Translation Corpora},
author = {Ekaterina Lapshinova-Koltunski and Christina Pollkl{\"a}sener and Heike Przybyl},
url = {https://ufal.mff.cuni.cz/pbml/119/art-lapshinova-koltunski-pollklaesener-przybyl.pdf},
doi = {https://doi.org/10.14712/00326585.020},
year = {2022},
date = {2022},
journal = {The Prague Bulletin of Mathematical Linguistics},
pages = {5-22},
volume = {119},
abstract = {We present a study of discourse connectives in English-German and German-English translation and interpreting where we focus on the phenomena of explicitation and implicitation. Apart from distributional analysis of translation patterns in parallel data, we also look into surprisal, i.e. an information-theoretic measure of cognitive effort, which helps us to interpret the observed tendencies.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B7

Przybyl, Heike; Lapshinova-Koltunski, Ekaterina; Menzel, Katrin; Fischer, Stefan; Teich, Elke

EPIC UdS - Creation and applications of a simultaneous interpreting corpus Inproceedings

Proceedings of the  13th Conference on Language Resources and Evaluation (LREC 2022), pp. 1193–1200, Marseille, France, 20-25 June 2022, 2022.

In this paper, we describe the creation and annotation of EPIC UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the comparable and parallel, aligned corpus variants and explore various applications of the corpus. What makes EPIC UdS relevant is that it is one of the rare interpreting corpora that includes transcripts suitable for research on more than one language pair and on interpreting with regard to German. It not only contains transcribed speeches, but also rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields.

@inproceedings{Przybyl_interpreting_2022,
title = {EPIC UdS - Creation and applications of a simultaneous interpreting corpus},
author = {Heike Przybyl and Ekaterina Lapshinova-Koltunski and Katrin Menzel and Stefan Fischer and Elke Teich},
url = {https://aclanthology.org/2022.lrec-1.127/},
year = {2022},
date = {2022},
booktitle = {Proceedings of the  13th Conference on Language Resources and Evaluation (LREC 2022)},
pages = {1193–1200},
address = {Marseille, France, 20-25 June 2022},
abstract = {In this paper, we describe the creation and annotation of EPIC UdS, a multilingual corpus of simultaneous interpreting for English, German and Spanish. We give an overview of the comparable and parallel, aligned corpus variants and explore various applications of the corpus. What makes EPIC UdS relevant is that it is one of the rare interpreting corpora that includes transcripts suitable for research on more than one language pair and on interpreting with regard to German. It not only contains transcribed speeches, but also rich metadata and fine-grained linguistic annotations tailored for diverse applications across a broad range of linguistic subfields.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age Proceeding

Bizzoni, Yuri; Teich, Elke; España-Bonet, Cristina; van Genabith, Josef;  (Ed.): Association for Computational Linguistics, online, 2021.

@proceeding{motra-2021-modelling,
title = {Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age},
author = {},
editor = {Yuri Bizzoni and Elke Teich and Cristina Espa{\~n}a-Bonet and Josef van Genabith},
url = {https://aclanthology.org/2021.motra-1.0/},
year = {2021},
date = {2021},
publisher = {Association for Computational Linguistics},
address = {online},
pubstate = {published},
type = {proceeding}
}

Copy BibTeX to Clipboard

Project:   B7

Karakanta, Alina; Przybyl, Heike; Teich, Elke

Exploring variation in translation with probabilistic language models Incollection

Lavid-López, Julia; Maíz-Arévalo, Carmen; Zamorano-Mansilla, Juan Rafael;  (Ed.): Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations, 158, Benjamins, pp. 308-323, Amsterdam, 2021.

While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.

@incollection{KarakantaEtAl2021,
title = {Exploring variation in translation with probabilistic language models},
author = {Alina Karakanta and Heike Przybyl and Elke Teich},
editor = {Julia Lavid-López and Carmen Ma{\'i}z-Ar{\'e}valo and Juan Rafael Zamorano-Mansilla},
url = {https://doi.org/10.1075/btl.158.12kar},
doi = {https://doi.org/10.1075/btl.158.12kar},
year = {2021},
date = {2021},
booktitle = {Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations},
pages = {308-323},
publisher = {Benjamins},
address = {Amsterdam},
abstract = {While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   B7

Lapshinova-Koltunski, Ekaterina; Przybyl, Heike; Bizzoni, Yuri

Tracing variation in discourse connectives in translation and interpreting through neural semantic spaces Inproceedings

Proceedings of the 2nd Workshop on Computational Approaches to Discourse CODI, pp. 134-142, Punta Cana, Dominican Republic and Online, 2021.

In the present paper, we explore lexical contexts of discourse markers in translation and interpreting on the basis of word embeddings. Our special interest is on contextual variation of the same discourse markers in (written) translation vs. (simultaneous) interpreting. To explore this variation at the lexical level, we use a data-driven approach: we compare bilingual neural word embeddings trained on source-to- translation and source-tointerpreting aligned corpora. Our results show more variation of semantically related items in translation spaces vs. interpreting ones and a more consistent use of fewer connectives in interpreting. We also observe different trends with regard to the discourse relation types.

@inproceedings{LapshinovaEtAl2021codi,
title = {Tracing variation in discourse connectives in translation and interpreting through neural semantic spaces},
author = {Ekaterina Lapshinova-Koltunski and Heike Przybyl and Yuri Bizzoni},
url = {https://aclanthology.org/2021.codi-main.13/},
year = {2021},
date = {2021},
booktitle = {Proceedings of the 2nd Workshop on Computational Approaches to Discourse CODI},
pages = {134-142},
address = {Punta Cana, Dominican Republic and Online},
abstract = {In the present paper, we explore lexical contexts of discourse markers in translation and interpreting on the basis of word embeddings. Our special interest is on contextual variation of the same discourse markers in (written) translation vs. (simultaneous) interpreting. To explore this variation at the lexical level, we use a data-driven approach: we compare bilingual neural word embeddings trained on source-to- translation and source-tointerpreting aligned corpora. Our results show more variation of semantically related items in translation spaces vs. interpreting ones and a more consistent use of fewer connectives in interpreting. We also observe different trends with regard to the discourse relation types.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Lapshinova-Koltunski, Ekaterina

Measuring Translationese across Levels of Expertise: Are Professionals more Surprising than Students? Inproceedings

Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Linköping University Electronic Press, Sweden, pp. 53-63, 2021.

The present paper deals with a computational analysis of translationese in professional and student English-to-German translations belonging to different registers. Building upon an information-theoretical approach, we test translation conformity to source and target language in terms of a neural language model’s perplexity over Part of Speech (PoS) sequences. Our primary focus is on register diversification vs. convergence, reflected in the use of constructions eliciting a higher vs. lower perplexity score. Our results show that, against our expectations, professional translations elicit higher perplexity scores from a target language model than students’ translations. An analysis of the distribution of PoS patterns across registers shows that this apparent paradox is the effect of higher stylistic diversification and register sensitivity in professional translations. Our results contribute to the understanding of human translationese and shed light on the variation in texts generated by different translators, which is valuable for translation studies, multilingual language processing, and machine translation.

@inproceedings{Bizzoni2021,
title = {Measuring Translationese across Levels of Expertise: Are Professionals more Surprising than Students?},
author = {Yuri Bizzoni and Ekaterina Lapshinova-Koltunski},
url = {https://aclanthology.org/2021.nodalida-main.6},
year = {2021},
date = {2021},
booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)},
pages = {53-63},
publisher = {Link{\"o}ping University Electronic Press, Sweden},
abstract = {The present paper deals with a computational analysis of translationese in professional and student English-to-German translations belonging to different registers. Building upon an information-theoretical approach, we test translation conformity to source and target language in terms of a neural language model’s perplexity over Part of Speech (PoS) sequences. Our primary focus is on register diversification vs. convergence, reflected in the use of constructions eliciting a higher vs. lower perplexity score. Our results show that, against our expectations, professional translations elicit higher perplexity scores from a target language model than students’ translations. An analysis of the distribution of PoS patterns across registers shows that this apparent paradox is the effect of higher stylistic diversification and register sensitivity in professional translations. Our results contribute to the understanding of human translationese and shed light on the variation in texts generated by different translators, which is valuable for translation studies, multilingual language processing, and machine translation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Menzel, Katrin; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina

EPIC-UdS - ein mehrsprachiges Korpus als Grundlage für die korpusbasierte Dolmetsch- und Übersetzungswissenschaft Miscellaneous

TRANSLATA IV - 4. Internationale Konferenz zur Translationswissenschaft, Innsbruck, 2021.

@miscellaneous{Menzel2021epic,
title = {EPIC-UdS - ein mehrsprachiges Korpus als Grundlage f{\"u}r die korpusbasierte Dolmetsch- und {\"U}bersetzungswissenschaft},
author = {Katrin Menzel and Heike Przybyl and Ekaterina Lapshinova-Koltunski},
year = {2021},
date = {2021},
booktitle = {TRANSLATA IV - 4. Internationale Konferenz zur Translationswissenschaft},
address = {Innsbruck},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B7

Lapshinova-Koltunski, Ekaterina; Bizzoni, Yuri; Przybyl, Heike; Teich, Elke

Found in translation/interpreting: combining data-driven and supervised methods to analyse cross-linguistically mediated communication Inproceedings

Proceedings of the Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21), Association for Computational Linguistics, pp. 82-90, online, 2021.

We report on a study of the specific linguistic properties of cross-linguistically mediated communication, comparing written and spoken translation (simultaneous interpreting) in the domain of European Parliament discourse. Specifically, we compare translations and interpreting with target language original texts/speeches in terms of (a) predefined features commonly used for translationese detection, and (b) features derived in a data-driven fashion from translation and interpreting corpora. For the latter, we use n-gram language models combined with relative entropy (Kullback-Leibler Divergence). We set up a number of classification tasks comparing translations with comparable texts originally written in the target language and interpreted speeches with target language comparable speeches to assess the contributions of predefined and data-driven features to the distinction between translation, interpreting and originals. Our analysis reveals that interpreting is more distinct from comparable originals than translation and that its most distinctive features signal an overemphasis of oral, online production more than showing traces of cross-linguistically mediated communication.

@inproceedings{LapshinovaEtAl2021interp,
title = {Found in translation/interpreting: combining data-driven and supervised methods to analyse cross-linguistically mediated communication},
author = {Ekaterina Lapshinova-Koltunski and Yuri Bizzoni and Heike Przybyl and Elke Teich},
url = {https://aclanthology.org/2021.motra-1.9/},
year = {2021},
date = {2021-05-31},
booktitle = {Proceedings of the Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21)},
pages = {82-90},
publisher = {Association for Computational Linguistics},
address = {online},
abstract = {We report on a study of the specific linguistic properties of cross-linguistically mediated communication, comparing written and spoken translation (simultaneous interpreting) in the domain of European Parliament discourse. Specifically, we compare translations and interpreting with target language original texts/speeches in terms of (a) predefined features commonly used for translationese detection, and (b) features derived in a data-driven fashion from translation and interpreting corpora. For the latter, we use n-gram language models combined with relative entropy (Kullback-Leibler Divergence). We set up a number of classification tasks comparing translations with comparable texts originally written in the target language and interpreted speeches with target language comparable speeches to assess the contributions of predefined and data-driven features to the distinction between translation, interpreting and originals. Our analysis reveals that interpreting is more distinct from comparable originals than translation and that its most distinctive features signal an overemphasis of oral, online production more than showing traces of cross-linguistically mediated communication.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Lapshinova-Koltunski, Ekaterina

Analysing the Dimension of Mode in Translation Book Chapter

Bisiada, Mario;  (Ed.): Empirical Studies in Translation and Discourse. Translation and Multilingual Natural Language Processing, Language Science Press, pp. 223-243, Berlin, 2021, ISBN 978-3-96110-300-3, ISSN 2364-8899.

The present chapter applies text classification to test how well we can distinguish between texts along two dimensions: a text-production dimension that distinguishes between translations and non-translations (where translations also include interpreted texts); and a mode dimension that distinguishes between and spoken and written texts. The chapter also aims to investigate the relationship between these two dimensions. Moreover, it investigates whether the same linguistic features that are derived from variational linguistics contribute to the prediction of mode in both translations and non-translations. The distributional information about these features was used to statistically model variation along the two dimensions. The results show that the same feature set can be used to automatically differentiate translations from non-translations, as well as spoken texts from the written texts. However, language variation along the dimension of mode is stronger
than that along the dimension of text production, as classification into spoken and written texts delivers better results. Besides, linguistic features that contribute to the distinction between spoken and written mode are similar in both translated and non-translated language.

@inbook{Lapshinova2021dimension,
title = {Analysing the Dimension of Mode in Translation},
author = {Ekaterina Lapshinova-Koltunski},
editor = {Mario Bisiada},
url = {https://doi.org/10.5281/zenodo.4450014},
doi = {https://doi.org/10.5281/zenodo.4450014},
year = {2021},
date = {2021},
booktitle = {Empirical Studies in Translation and Discourse. Translation and Multilingual Natural Language Processing},
isbn = {978-3-96110-300-3},
issn = {2364-8899},
pages = {223-243},
publisher = {Language Science Press},
address = {Berlin},
abstract = {The present chapter applies text classification to test how well we can distinguish between texts along two dimensions: a text-production dimension that distinguishes between translations and non-translations (where translations also include interpreted texts); and a mode dimension that distinguishes between and spoken and written texts. The chapter also aims to investigate the relationship between these two dimensions. Moreover, it investigates whether the same linguistic features that are derived from variational linguistics contribute to the prediction of mode in both translations and non-translations. The distributional information about these features was used to statistically model variation along the two dimensions. The results show that the same feature set can be used to automatically differentiate translations from non-translations, as well as spoken texts from the written texts. However, language variation along the dimension of mode is stronger than that along the dimension of text production, as classification into spoken and written texts delivers better results. Besides, linguistic features that contribute to the distinction between spoken and written mode are similar in both translated and non-translated language.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B7

Teich, Elke; Martínez Martínez, José; Karakanta, Alina

Translation, information theory and cognition Book Chapter

Alves, Fabio; Lykke Jakobsen, Arnt (Ed.): The Routledge Handbook of Translation and Cognition, Routledge, pp. 360-375, London, UK, 2020, ISBN 9781138037007.

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.

@inbook{Teich-etal2020-handbook,
title = {Translation, information theory and cognition},
author = {Elke Teich and Jos{\'e} Mart{\'i}nez Mart{\'i}nez and Alina Karakanta},
editor = {Fabio Alves and Arnt Lykke Jakobsen},
url = {https://www.taylorfrancis.com/chapters/edit/10.4324/9781315178127-24/translation-information-theory-cognition-elke-teich-josé-martínez-martínez-alina-karakanta},
year = {2020},
date = {2020},
booktitle = {The Routledge Handbook of Translation and Cognition},
isbn = {9781138037007},
pages = {360-375},
publisher = {Routledge},
address = {London, UK},
abstract = {

The chapter sketches a formal basis for the probabilistic modelling of human translation on the basis of information theory. We provide a definition of Shannon information applied to linguistic communication and discuss its relevance for modelling translation. We further explain the concept of the noisy channel and provide the link to modelling human translational choice. We suggest that a number of translation-relevant variables, notably (dis)similarity between languages, level of expertise and translation mode (i.e., interpreting vs. translation), may be appropriately indexed by entropy, which in turn has been shown to indicate production effort.
},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B7

Bizzoni, Yuri; Juzek, Tom; España-Bonet, Cristina; Dutta Chowdhury, Koel; van Genabith, Josef; Teich, Elke

How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech Inproceedings

The 17th International Workshop on Spoken Language Translation, Seattle, WA, United States, 2020.

Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).

@inproceedings{Bizzoni2020,
title = {How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech},
author = {Yuri Bizzoni and Tom Juzek and Cristina Espa{\~n}a-Bonet and Koel Dutta Chowdhury and Josef van Genabith and Elke Teich},
url = {https://aclanthology.org/2020.iwslt-1.34/},
doi = {https://doi.org/10.18653/v1/2020.iwslt-1.34},
year = {2020},
date = {2020},
booktitle = {The 17th International Workshop on Spoken Language Translation},
address = {Seattle, WA, United States},
abstract = {Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs. machine) rather than to the data (written vs. spoken).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B6 B7

Bizzoni, Yuri; Teich, Elke

Analyzing variation in translation through neural semantic spaces Inproceedings

Special topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, BulgariaSpecial topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria, 2019.

We present an approach for exploring the lexical choice patterns in translation on the basis of word embeddings. Specifically, we are interested in variation in translation according to translation mode, i.e. (written) translation vs. (simultaneous) interpreting. While it might seem obvious that the outputs of the two translation modes differ, there are hardly any accounts of the summative linguistic effects of one vs. the other. To explore such effects at the lexical level, we propose a data-driven approach: using neural word embeddings (Word2Vec), we compare the bilingual semantic spaces emanating from source-totranslation and source-to-interpreting.

@inproceedings{Bizzoni2019,
title = {Analyzing variation in translation through neural semantic spaces},
author = {Yuri Bizzoni and Elke Teich},
url = {https://comparable.limsi.fr/bucc2019/Bizzoni_BUCC2019_paper1.pdf},
year = {2019},
date = {2019-08-30},
booktitle = {Special topic: Neural Networks for Building and Using Comparable Corpora, Recent Advances in Natural Language Processing (RANLP), Varna, Bulgaria},
address = {Varna, Bulgaria},
abstract = {We present an approach for exploring the lexical choice patterns in translation on the basis of word embeddings. Specifically, we are interested in variation in translation according to translation mode, i.e. (written) translation vs. (simultaneous) interpreting. While it might seem obvious that the outputs of the two translation modes differ, there are hardly any accounts of the summative linguistic effects of one vs. the other. To explore such effects at the lexical level, we propose a data-driven approach: using neural word embeddings (Word2Vec), we compare the bilingual semantic spaces emanating from source-totranslation and source-to-interpreting.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Karakanta, Alina; Menzel, Katrin; Przybyl, Heike; Teich, Elke

Detecting linguistic variation in translated vs. interpreted texts using relative entropy Inproceedings

Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan, 2019.

Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.

@inproceedings{Karakanta2019,
title = {Detecting linguistic variation in translated vs. interpreted texts using relative entropy},
author = {Alina Karakanta and Katrin Menzel and Heike Przybyl and Elke Teich},
url = {https://www.researchgate.net/publication/336990114_Detecting_linguistic_variation_in_translated_vs_interpreted_texts_using_relative_entropy},
year = {2019},
date = {2019},
booktitle = {Empirical Investigations in the Forms of Mediated Discourse at the European Parliament, Thematic Session at the 49th Poznan Linguistic Meeting (PLM2019), Poznan},
abstract = {Our aim is to identify the features distinguishing simultaneously interpreted texts from translations (apart from being more oral) and the characteristics they have in common which set them apart from originals (translationese features). Empirical research on the features of interpreted language and cross-modal analyses in contrast to research on translated language alone has attracted wider interest only recently. Previous interpreting studies are typically based on relatively small datasets of naturally occurring or experimental data (e.g. Shlesinger/Ordan, 2012, Chmiel et al. forthcoming, Dragsted/Hansen 2009) for specific language pairs. We propose a corpus-based, exploratory approach to detect typical linguistic features of interpreting vs. translation based on a well-structured multilingual European Parliament translation and interpreting corpus. We use the Europarl-UdS corpus (Karakanta et al. 2018)1 containing originals and translations for English, German and Spanish, and selected material from existing interpreting/combined interpreting-translation corpora (EPIC: Sandrelli/Bendazzoli 2005; TIC: Kajzer-Wietrzny 2012; EPICG: Defrancq 2015), complemented with additional interpreting data (German). The data were transcribed or revised according to our transcription guidelines ensuring comparability across different datasets. All data were enriched with relevant metadata. We aim to contribute to a more nuanced understanding of the characteristics of translated and interpreted texts and a more adequate empirical theory of mediated discourse.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Karakanta, Alina; Przybyl, Heike; Teich, Elke

Exploring Variation in Translation with Relative Entropy Inproceedings

Lavid-López, Carmen Maíz-Arévalo and Juan Rafael Zamorano-Mansilla, Julia (Ed.): Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations, John Benjamins Publishing Company, pp. 307–323, 2018.

While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.

@inproceedings{Karakanta2018b,
title = {Exploring Variation in Translation with Relative Entropy},
author = {Alina Karakanta and Heike Przybyl and Elke Teich},
editor = {Julia Lavid-López Carmen Ma{\'i}z-Ar{\'e}valo and Juan Rafael Zamorano-Mansilla},
url = {https://benjamins.com/catalog/btl.158.12kar},
doi = {https://doi.org/10.1075/btl.158.12kar},
year = {2018},
date = {2018},
booktitle = {Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations},
pages = {307–323},
publisher = {John Benjamins Publishing Company},
abstract = {

While some authors have suggested that translationese fingerprints are universal, others have shown that there is a fair amount of variation among translations due to source language shining through, translation type or translation mode. In our work, we attempt to gain empirical insights into variation in translation, focusing here on translation mode (translation vs. interpreting). Our goal is to discover features of translationese and interpretese that distinguish translated and interpreted output from comparable original text/speech as well as from each other at different linguistic levels. We use relative entropy (Kullback-Leibler Divergence) and visualization with word clouds. Our analysis shows differences in typical words between originals vs. non-originals as well as between translation modes both at lexical and grammatical levels.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B7

Successfully