Measuring multilingualism and language mixing in corpus texts - Speaker: Roland Meyer

While multilingual data has been digitally coded for long (Barnett et al. 2000 and references therein), the proper methodology for capturing and comparing code-switching and mixed language use in corpus texts is still vividly disputed (e.g., Das&Gambäck 2014; Guzmán et al. 2017). A major obstacle is the degree of observable variation at all linguistic levels, from phonology to discourse, which calls for extremely detailed, often hand-crafted annotation. While this may still be doable for small code-switching corpora, it is certainly out of the question for cases like Hinglish (Hindi-English mixing), where only automatically calculated indices of multilingualism may help (Srivastava&Singh 2021). In a project on register and language mixing (SFB 1412, Meyer&Szucsich), we specifically address mixing in closely related Slavic languages – Ukrainian-Russian so-called „Surzhyk“ and Polish-Czech „Po naszymu“. For the evaluation of situationally conditioned linguistic behaviour, we must be able to map and quantify language mixing in texts. To this end, we evaluate measures and models developed for Slavic intercomprehension (Jágrová et al. 2018; Avgustinova 2020; Stenger et al. 2020, 2022; Kunilovskaya et al. 2025). In the talk, I present our current work towards profiles of language mixing and discuss its application to mixed Slavic texts.

Successfully