Publications - SFB 1102

Dyer, Andrew

What does Surprisal have to do with Information Status? Inproceedings

Vylomova, Ekaterina; Shcherbakov, Andrei; Rani, Priya (Ed.): Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual {NLP}, Association for Computational Linguistics, pp. 26-31, Rabat, Morocco, 2026, ISBN 979-8-89176-374-6.

Abstract
|
Links
|
BibTeX

It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.

@inproceedings{dyer-2026-surprisal,
title = {What does Surprisal have to do with Information Status?},
author = {Andrew Dyer},
editor = {Ekaterina Vylomova and Andrei Shcherbakov and Priya Rani},
url = {https://aclanthology.org/2026.sigtyp-main.4/},
doi = {https://doi.org/10.18653/v1/2026.sigtyp-main.4},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual {NLP}},
isbn = {979-8-89176-374-6},
pages = {26-31},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Steuer, Julius; Nakai, Toshiki; Dyer, Andrew; Talamo, Luigi; Verkerk, Annemarie

Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus Inproceedings

Vylomova, Ekaterina; Shcherbakov, Andrei; Rani, Priya (Ed.): Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 18-25, Rabat, Morocco, 2026, ISBN 979-8-89176-374-6.

Abstract
|
Links
|
BibTeX

The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.

@inproceedings{steuer-etal-2026-evaluating,
title = {Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus},
author = {Julius Steuer and Toshiki Nakai and Andrew Dyer and Luigi Talamo and Annemarie Verkerk},
editor = {Ekaterina Vylomova and Andrei Shcherbakov and Priya Rani},
url = {https://aclanthology.org/2026.sigtyp-main.3/},
doi = {https://doi.org/10.18653/v1/2026.sigtyp-main.3},
year = {2026},
date = {2026},
booktitle = {Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
isbn = {979-8-89176-374-6},
pages = {18-25},
publisher = {Association for Computational Linguistics},
address = {Rabat, Morocco},
abstract = {The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Dyer, Andrew; O’Brien, Colleen Alena

Towards better annotation practices for symmetrical voice in Universal Dependencies Inproceedings

Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025), Association for Computational Linguistics, pp. 137-142, Ljubljana, Slovenia, 2025, ISBN 979-8-89176-292-3.

Abstract
|
Links
|
BibTeX

Austronesian languages exhibit features that are challenging for Universal Dependencies: most notably, the symmetric voice system, whereby agent, patient, and instrumental arguments (among others) can be the pivot of a transitive structure – complicating the usual assumption that subjects of transitive sentences are semantic agents, and objects semantic patients. To showcase our ideas of how to address the representation of such systems in Universal Dependencies, we introduce a small treebank of sentences from texts and elicitation sessions in Gorontalo, an Austronesian language of Sulawesi (Indonesia), which exhibits a Philippine-type voice system. We discuss the annotation guidelines for this language, and the extensions of the Universal Dependencies guidelines that are needed to accommodate this and other Austronesian languages.

2025.udw-1.15 (0.11MB)
https://aclanthology.org/2025.udw-1.15/

@inproceedings{dyer-obrien-2025-towards,
title = {Towards better annotation practices for symmetrical voice in Universal Dependencies},
author = {Andrew Dyer and Colleen Alena O’Brien},
url = {https://aclanthology.org/2025.udw-1.15/},
year = {2025},
date = {2025},
booktitle = {Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)},
isbn = {979-8-89176-292-3},
pages = {137-142},
publisher = {Association for Computational Linguistics},
address = {Ljubljana, Slovenia},
abstract = {Austronesian languages exhibit features that are challenging for Universal Dependencies: most notably, the symmetric voice system, whereby agent, patient, and instrumental arguments (among others) can be the pivot of a transitive structure – complicating the usual assumption that subjects of transitive sentences are semantic agents, and objects semantic patients. To showcase our ideas of how to address the representation of such systems in Universal Dependencies, we introduce a small treebank of sentences from texts and elicitation sessions in Gorontalo, an Austronesian language of Sulawesi (Indonesia), which exhibits a Philippine-type voice system. We discuss the annotation guidelines for this language, and the extensions of the Universal Dependencies guidelines that are needed to accommodate this and other Austronesian languages.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Verkerk, Annemarie; Shcherbakova, Olena; Haynie, Hannah J. ; Skirgård, Hedvig; Rzymski, Christoph; Atkinson, Quentin D.; Greenhill, Simon J.; Gray, Russell D.

Enduring constraints on grammar revealed by Bayesian spatiophylogenetic analyses Journal Article

Nature Human Behaviour, 2025, ISSN 2397-3374.

Abstract
|
Links
|
BibTeX

Human languages show astonishing variety, yet their diversity is constrained by recurring patterns. Linguists have long argued over the extent and causes of these grammatical ‘universals’. Using Grambank—a comprehensive database of grammatical features across the world’s languages—we tested 191 proposed universals with Bayesian analyses that account for both genealogical descent and geographical proximity. We find statistical support for about a third of the proposed linguistic universals. The majority of these concern word order and hierarchical universals: two types that have featured prominently in earlier work. Evolutionary analyses show that languages tend to change in ways that converge on these preferred patterns. This suggests that, despite the vast design space of possible grammars, languages do not evolve entirely at random. Shared cognitive and communicative pressures repeatedly push languages towards similar solutions.

@article{Verkerk-etal-2025-Bayesian,
title = {Enduring constraints on grammar revealed by Bayesian spatiophylogenetic analyses},
author = {Annemarie Verkerk and Olena Shcherbakova and Hannah J. Haynie and Hedvig Skirgård and Christoph Rzymski and Quentin D. Atkinson and Simon J. Greenhill and Russell D. Gray},
url = {https://doi.org/10.1038/s41562-025-02325-z},
doi = {https://doi.org/10.1038/s41562-025-02325-z},
year = {2025},
date = {2025},
journal = {Nature Human Behaviour},
abstract = {Human languages show astonishing variety, yet their diversity is constrained by recurring patterns. Linguists have long argued over the extent and causes of these grammatical ‘universals’. Using Grambank—a comprehensive database of grammatical features across the world’s languages—we tested 191 proposed universals with Bayesian analyses that account for both genealogical descent and geographical proximity. We find statistical support for about a third of the proposed linguistic universals. The majority of these concern word order and hierarchical universals: two types that have featured prominently in earlier work. Evolutionary analyses show that languages tend to change in ways that converge on these preferred patterns. This suggests that, despite the vast design space of possible grammars, languages do not evolve entirely at random. Shared cognitive and communicative pressures repeatedly push languages towards similar solutions.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C7

Talamo, Luigi

Introducing STAF: The Saarbrücken Treebank of Albanian Fiction Journal Article

Journal of Open Humanities Data, 11, pp. 1–6, 2025.

Abstract
|
Links
|
BibTeX

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.

@article{Talamo-2025,
title = {Introducing STAF: The Saarbr{\"u}cken Treebank of Albanian Fiction},
author = {Luigi Talamo},
url = {https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.285},
doi = {https://doi.org/10.5334/johd.285},
year = {2025},
date = {2025},
journal = {Journal of Open Humanities Data},
pages = {1–6},
volume = {11},
number = {3},
abstract = {

The present paper describes the building of STAF, a Universal Dependencies treebank for Albanian. STAF was bootstrapped using a Stanza model trained on previously unreleased data and then manually corrected by three Albanian speakers supervised by the author, who also revised all sentences. STAF focuses on the fiction genre, featuring 200 sentences selected from nine literary texts written by Albanian contemporary authors.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C7

Dyer, Andrew; Betul, Ruveyda; Rajestari, Maryam; Rouvalis, Andreas; Singhal, Aarushi; Stodolinska, Yuliya; Asma, Syahidah; Rodrigues, Helena

A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain Inproceedings

Dakota, Daniel; Jablotschkin, Sarah; Kübler, Sandra; Zinsmeister, Heike (Ed.): Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024), Association for Computational Linguistics, pp. 55-64, Hamburg, Germany, 2024.

Abstract
|
Links
|
BibTeX

Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.

2024.tlt-1.7 (0.17MB)
https://aclanthology.org/2024.tlt-1.7/

@inproceedings{dyer-etal-2024-multilingual,
title = {A Multilingual Parallel Corpus for Coreference Resolution and Information Status in the Literary Domain},
author = {Andrew Dyer andRuveyda Betul Bahceci and Maryam Rajestari and Andreas Rouvalis and Aarushi Singhal and Yuliya Stodolinska and Syahidah Asma Umniyati and Helena Rodrigues Menezes de Oliveira Vaz},
editor = {Daniel Dakota and Sarah Jablotschkin and Sandra K{\"u}bler and Heike Zinsmeister},
url = {https://aclanthology.org/2024.tlt-1.7/},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)},
pages = {55-64},
publisher = {Association for Computational Linguistics},
address = {Hamburg, Germany},
abstract = {Information status — the newness or givenness of referents in discourse — is known to affect the production of language at many different levels. At the morphosyntactic level, information status gives rise to special words orders, elisions, and other phenomena that challenge the notion that morphosyntax can be considered independent of discourse context. Though there are many language-specific corpora annotated for information status and its related phenomena, coreference and anaphora resolution, what is not available at present is a cross-lingually consistently annotated corpus or annotation scheme that would allow for comparativestudy of these phenomena across many diverse languages. In this paper we present our work to build such a resource. We are annotating a parsed, parallel corpus of prose in many languages for information status and coreference resolution, so that like-for-like cross-lingual comparisons can be made at the intersection of discourse and syntax. Our corpus can and will be used both for corpus analysis and for model training.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Talamo, Luigi; Verkerk, Annemarie; Salaberri, Iker

A quantitative approach to clause type and syntactic change in two Indo-European corpora Journal Article

Italian Journal of Linguistics, 36, pp. 53-82, 2024.

Abstract
|
Links
|
BibTeX

The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.

@article{Talamo-etal-2024,
title = {A quantitative approach to clause type and syntactic change in two Indo-European corpora},
author = {Luigi Talamo and Annemarie Verkerk andIker Salaberri},
url = {https://www.italian-journal-linguistics.com/current-issue/},
doi = {https://doi.org/10.26346/1120-2726-225},
year = {2024},
date = {2024},
journal = {Italian Journal of Linguistics},
pages = {53-82},
volume = {36},
number = {2},
abstract = {The aim of this paper is to empirically test the claim that subordinate clauses tend to preserve conservative features in language change. To this end, the diachronic behavior of two well-understood and frequently adduced features of grammar, namely null subject pronouns and order of subject, object and verb, is analyzed for main and adverbial clauses in a balanced corpus of 45 IndoEuropean languages. This study combines qualitative and quantitative analysis by drawing on individual descriptive grammars and parallel corpora respectively. Additionally, diachronic change is modeled using phylogenetic comparative methods. The data suggest that adverbial clauses can in some cases develop asymmetries with respect to their independent counterparts, either through innovation or through preservation of conservative features, possibly due to a communicative need to distinguish clause types by means of grammar. However, the general tendency is for adverbial clauses to change much in the same way as main clauses. This finding contradicts previous claims and calls for a reassessment of studies on the diachronic nature of distinct clause types.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C7

Verkerk, Annemarie; Talamo, Luigi

mini-CIEP+ : A Shareable Parallel Corpus of Prose Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 135-143, Torino, Italia, 2024.

Abstract
|
Links
|
BibTeX

In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.

2024.bucc-1.15 (0.23MB)
https://aclanthology.org/2024.bucc-1.15

@inproceedings{verkerk-talamo-2024-mini,
title = {mini-CIEP+ : A Shareable Parallel Corpus of Prose},
author = {Annemarie Verkerk and Luigi Talamo},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.15},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {135-143},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Talamo, Luigi

Using a parallel corpus to study patterns of word order variation: Determiners and quantifiers within the noun phrase in European languages Journal Article

Linguistic Typology at the Crossroads, 3, pp. 100–131, Bologna, Italy, 2023.

Abstract
|
Links
|
BibTeX

Despite the wealth of studies on word order, there have been very few studies on the order of minor word categories such as determiners and quantifiers. This is likely due to the difficulty of formulating valid cross-linguistic definitions for these categories, which also appear problematic from a computational perspective. A solution lies in the formulation of comparative concepts and in their computational implementation by combining different layers of annotation with manually compiled list of lexemes; the proposed methodology is exemplified by a study on the position of these categories with respect to the nominal head, which is conducted on a parallel corpus of 17 European languages and uses Shannon’s entropy to quantify word order variation. Whereas the entropy for the article-noun pattern is, as expected, extremely low, the proposed methodology sheds light on the variation of the demonstrative-noun and the quantifier-noun patterns in three languages of the sample.

@article{talamo_2023,
title = {Using a parallel corpus to study patterns of word order variation: Determiners and quantifiers within the noun phrase in European languages},
author = {Luigi Talamo},
url = {https://typologyatcrossroads.unibo.it/article/view/15653},
doi = {https://doi.org/10.6092/issn.2785-0943/15653},
year = {2023},
date = {2023},
journal = {Linguistic Typology at the Crossroads},
pages = {100–131},
address = {Bologna, Italy},
volume = {3},
number = {2},
abstract = {

Despite the wealth of studies on word order, there have been very few studies on the order of minor word categories such as determiners and quantifiers. This is likely due to the difficulty of formulating valid cross-linguistic definitions for these categories, which also appear problematic from a computational perspective. A solution lies in the formulation of comparative concepts and in their computational implementation by combining different layers of annotation with manually compiled list of lexemes; the proposed methodology is exemplified by a study on the position of these categories with respect to the nominal head, which is conducted on a parallel corpus of 17 European languages and uses Shannon’s entropy to quantify word order variation. Whereas the entropy for the article-noun pattern is, as expected, extremely low, the proposed methodology sheds light on the variation of the demonstrative-noun and the quantifier-noun patterns in three languages of the sample.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C7

Dyer, Andrew

Revisiting dependency length and intervener complexity minimisation on a parallel corpus in 35 languages Inproceedings

Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 110-119, Dubrovnik, Croatia, 2023.

Abstract
|
Links
|
BibTeX

In this replication study of previous research into dependency length minimisation (DLM), we pilot a new parallel multilingual parsed corpus to examine whether previous findings are upheld when controlling for variation in domain and sentence content between languages. We follow the approach of previous research in comparing the dependency lengths of observed sentences in a multilingual corpus to a variety of baselines: permutations of the sentences, either random or according to some fixed schema. We go on to compare DLM with intervener complexity measure (ICM), an alternative measure of syntactic complexity. Our findings uphold both dependency length and intervener complexity minimisation in all languages under investigation. We also find a markedly lesser extent of dependency length minimisation in verbfinal languages, and the same for intervener complexity measure. We conclude that dependency length and intervener complexity minimisation as universals are upheld when controlling for domain and content variation, but that further research is needed into the asymmetry between verb-final and other languages in this regard.

@inproceedings{dyer-2023-revisiting,
title = {Revisiting dependency length and intervener complexity minimisation on a parallel corpus in 35 languages},
author = {Andrew Dyer},
url = {https://aclanthology.org/2023.sigtyp-1.11/},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {110-119},
publisher = {Association for Computational Linguistics},
address = {Dubrovnik, Croatia},
abstract = {

In this replication study of previous research into dependency length minimisation (DLM), we pilot a new parallel multilingual parsed corpus to examine whether previous findings are upheld when controlling for variation in domain and sentence content between languages. We follow the approach of previous research in comparing the dependency lengths of observed sentences in a multilingual corpus to a variety of baselines: permutations of the sentences, either random or according to some fixed schema. We go on to compare DLM with intervener complexity measure (ICM), an alternative measure of syntactic complexity. Our findings uphold both dependency length and intervener complexity minimisation in all languages under investigation. We also find a markedly lesser extent of dependency length minimisation in verbfinal languages, and the same for intervener complexity measure. We conclude that dependency length and intervener complexity minimisation as universals are upheld when controlling for domain and content variation, but that further research is needed into the asymmetry between verb-final and other languages in this regard.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Talamo, Luigi

Tweaking UD Annotations to Investigate the Placement of Determiners, Quantifiers and Numerals in the Noun Phrase Inproceedings

Vylomova, Ekaterina; Ponti, Edoardo; Cotterell, Ryan (Ed.): Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 36-41, Seattle, Washington, 2022.

Abstract
|
Links
|
BibTeX

We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.

@inproceedings{Talamo_2022,
title = {Tweaking UD Annotations to Investigate the Placement of Determiners, Quantifiers and Numerals in the Noun Phrase},
author = {Luigi Talamo},
editor = {Ekaterina Vylomova and Edoardo Ponti and Ryan Cotterell},
url = {https://aclanthology.org/2022.sigtyp-1.5/},
doi = {https://doi.org/10.18653/v1/2022.sigtyp-1.5},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {36-41},
publisher = {Association for Computational Linguistics},
address = {Seattle, Washington},
abstract = {We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: C7

Talamo, Luigi; Verkerk, Annemarie

A new methodology for an old problem: A corpus-based typology of adnominal word order in European languages Journal Article

Italian Journal of Linguistics, 34, pp. 171-226, 2022.

Abstract
|
Links
|
BibTeX

Linguistic typology is generally characterized by strong data reduction, stemming from the use of binary or categorical classifications. An example are the categories commonly used in describing word order: adjective-noun vs noun-adjective; genitive-noun vs noun-genitive; etc. Token-based typology is part of an answer towards more fine-grained and appropriate measurement in typology. We discuss an implementation of this methodology and provide a case-study involving adnominal word order in a sample of eleven European languages, using a parallel corpus automatically parsed with models from the Universal Dependencies project. By quantifying adnominal word order variability in terms of Shannon’s entropy, we find that the placement of certain nominal modifiers in relation to their head noun is more variable than reported by typological databases , both within and across language genera. Whereas the low variability of placement of articles, adpositions and relative clauses is generally confirmed by our findings, the adnominal ordering of demonstratives and adjectives is more variable than previously reported.

@article{article,
title = {A new methodology for an old problem: A corpus-based typology of adnominal word order in European languages},
author = {Luigi Talamo and Annemarie Verkerk},
url = {https://www.italian-journal-linguistics.com/app/uploads/2023/01/8-Talamo.pdf},
doi = {https://doi.org/10.26346/1120-2726-197},
year = {2022},
date = {2022},
journal = {Italian Journal of Linguistics},
pages = {171-226},
volume = {34},
abstract = {

Linguistic typology is generally characterized by strong data reduction, stemming from the use of binary or categorical classifications. An example are the categories commonly used in describing word order: adjective-noun vs noun-adjective; genitive-noun vs noun-genitive; etc. Token-based typology is part of an answer towards more fine-grained and appropriate measurement in typology. We discuss an implementation of this methodology and provide a case-study involving adnominal word order in a sample of eleven European languages, using a parallel corpus automatically parsed with models from the Universal Dependencies project. By quantifying adnominal word order variability in terms of Shannon's entropy, we find that the placement of certain nominal modifiers in relation to their head noun is more variable than reported by typological databases , both within and across language genera. Whereas the low variability of placement of articles, adpositions and relative clauses is generally confirmed by our findings, the adnominal ordering of demonstratives and adjectives is more variable than previously reported.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: C7