Publications

Verkerk, Annemarie; Talamo, Luigi

mini-CIEP+ : A Shareable Parallel Corpus of Prose Inproceedings

Zweigenbaum, Pierre; Rapp, Reinhard; Sharoff, Serge (Ed.): Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024, ELRA and ICCL, pp. 135-143, Torino, Italia, 2024.

In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.

@inproceedings{verkerk-talamo-2024-mini,
title = {mini-CIEP+ : A Shareable Parallel Corpus of Prose},
author = {Annemarie Verkerk and Luigi Talamo},
editor = {Pierre Zweigenbaum and Reinhard Rapp and Serge Sharoff},
url = {https://aclanthology.org/2024.bucc-1.15},
year = {2024},
date = {2024},
booktitle = {Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024},
pages = {135-143},
publisher = {ELRA and ICCL},
address = {Torino, Italia},
abstract = {In this paper we present mini-CIEP+, a sharable parallel corpus of prose. mini-CIEP+ consists of the first part of ten different works of prose across many different languages, allowing for the cross-linguistic investigation of larger discourse units. Subcorpora typically contain 5750 sentences and almost 125K tokens. Subcorpora have dependency grammar annotation based on the Universal Dependencies standard (de Marneffe et al., 2021). mini-CIEP+ version 1.0 is available in 35 languages, with the aim of increasing the sample to 50 languages. It is shareable due to recent developments in German law, which allow researchers to share up to 15% of copy-righted material with a select group of people for their own research. Hence, mini-CIEP+ is not publically available, but is rather shareable in a modular fashion with select researchers. We additionally describe future plans for further annotation of mini-CIEP+ as well as its limitations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Talamo, Luigi

Using a parallel corpus to study patterns of word order variation: Determiners and quantifiers within the noun phrase in European languages Journal Article

Linguistic Typology at the Crossroads, 3, pp. 100–131, Bologna, Italy, 2023.
Despite the wealth of studies on word order, there have been very few studies on the order of minor word categories such as determiners and quantifiers. This is likely due to the difficulty of formulating valid cross-linguistic definitions for these categories, which also appear problematic from a computational perspective. A solution lies in the formulation of comparative concepts and in their computational implementation by combining different layers of annotation with manually compiled list of lexemes; the proposed methodology is exemplified by a study on the position of these categories with respect to the nominal head, which is conducted on a parallel corpus of 17 European languages and uses Shannon’s entropy to quantify word order variation. Whereas the entropy for the article-noun pattern is, as expected, extremely low, the proposed methodology sheds light on the variation of the demonstrative-noun and the quantifier-noun patterns in three languages of the sample.

@article{talamo_2023,
title = {Using a parallel corpus to study patterns of word order variation: Determiners and quantifiers within the noun phrase in European languages},
author = {Luigi Talamo},
url = {https://typologyatcrossroads.unibo.it/article/view/15653},
doi = {https://doi.org/10.6092/issn.2785-0943/15653},
year = {2023},
date = {2023},
journal = {Linguistic Typology at the Crossroads},
pages = {100–131},
address = {Bologna, Italy},
volume = {3},
number = {2},
abstract = {

Despite the wealth of studies on word order, there have been very few studies on the order of minor word categories such as determiners and quantifiers. This is likely due to the difficulty of formulating valid cross-linguistic definitions for these categories, which also appear problematic from a computational perspective. A solution lies in the formulation of comparative concepts and in their computational implementation by combining different layers of annotation with manually compiled list of lexemes; the proposed methodology is exemplified by a study on the position of these categories with respect to the nominal head, which is conducted on a parallel corpus of 17 European languages and uses Shannon’s entropy to quantify word order variation. Whereas the entropy for the article-noun pattern is, as expected, extremely low, the proposed methodology sheds light on the variation of the demonstrative-noun and the quantifier-noun patterns in three languages of the sample.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Dyer, Andrew

Revisiting dependency length and intervener complexity minimisation on a parallel corpus in 35 languages Inproceedings

Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 110-119, Dubrovnik, Croatia, 2023.

In this replication study of previous research into dependency length minimisation (DLM), we pilot a new parallel multilingual parsed corpus to examine whether previous findings are upheld when controlling for variation in domain and sentence content between languages. We follow the approach of previous research in comparing the dependency lengths of observed sentences in a multilingual corpus to a variety of baselines: permutations of the sentences, either random or according to some fixed schema. We go on to compare DLM with intervener complexity measure (ICM), an alternative measure of syntactic complexity. Our findings uphold both dependency length and intervener complexity minimisation in all languages under investigation. We also find a markedly lesser extent of dependency length minimisation in verbfinal languages, and the same for intervener complexity measure. We conclude that dependency length and intervener complexity minimisation as universals are upheld when controlling for domain and content variation, but that further research is needed into the asymmetry between verb-final and other languages in this regard.

@inproceedings{dyer-2023-revisiting,
title = {Revisiting dependency length and intervener complexity minimisation on a parallel corpus in 35 languages},
author = {Andrew Dyer},
url = {https://aclanthology.org/2023.sigtyp-1.11/},
year = {2023},
date = {2023},
booktitle = {Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {110-119},
publisher = {Association for Computational Linguistics},
address = {Dubrovnik, Croatia},
abstract = {

In this replication study of previous research into dependency length minimisation (DLM), we pilot a new parallel multilingual parsed corpus to examine whether previous findings are upheld when controlling for variation in domain and sentence content between languages. We follow the approach of previous research in comparing the dependency lengths of observed sentences in a multilingual corpus to a variety of baselines: permutations of the sentences, either random or according to some fixed schema. We go on to compare DLM with intervener complexity measure (ICM), an alternative measure of syntactic complexity. Our findings uphold both dependency length and intervener complexity minimisation in all languages under investigation. We also find a markedly lesser extent of dependency length minimisation in verbfinal languages, and the same for intervener complexity measure. We conclude that dependency length and intervener complexity minimisation as universals are upheld when controlling for domain and content variation, but that further research is needed into the asymmetry between verb-final and other languages in this regard.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Talamo, Luigi

Tweaking UD Annotations to Investigate the Placement of Determiners, Quantifiers and Numerals in the Noun Phrase Inproceedings

Vylomova, Ekaterina; Ponti, Edoardo; Cotterell, Ryan (Ed.): Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Association for Computational Linguistics, pp. 36-41, Seattle, Washington, 2022.

We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.

@inproceedings{Talamo_2022,
title = {Tweaking UD Annotations to Investigate the Placement of Determiners, Quantifiers and Numerals in the Noun Phrase},
author = {Luigi Talamo},
editor = {Ekaterina Vylomova and Edoardo Ponti and Ryan Cotterell},
url = {https://aclanthology.org/2022.sigtyp-1.5/},
doi = {https://doi.org/10.18653/v1/2022.sigtyp-1.5},
year = {2022},
date = {2022},
booktitle = {Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP},
pages = {36-41},
publisher = {Association for Computational Linguistics},
address = {Seattle, Washington},
abstract = {We describe a methodology to extract with finer accuracy word order patterns from texts automatically annotated with Universal Dependency (UD) trained parsers. We use the methodology to quantify the word order entropy of determiners, quantifiers and numerals in ten Indo-European languages, using UD-parsed texts from a parallel corpus of prosaic texts. Our results suggest that the combinations of different UD annotation layers, such as UD Relations, Universal Parts of Speech and lemma, and the introduction of language-specific lists of closed-category lemmata has the two-fold effect of improving the quality of analysis and unveiling hidden areas of variability in word order patterns.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C7

Talamo, Luigi; Verkerk, Annemarie

A new methodology for an old problem: A corpus-based typology of adnominal word order in European languages Journal Article

Italian Journal of Linguistics, 34, pp. 171-226, 2022.
Linguistic typology is generally characterized by strong data reduction, stemming from the use of binary or categorical classifications. An example are the categories commonly used in describing word order: adjective-noun vs noun-adjective; genitive-noun vs noun-genitive; etc. Token-based typology is part of an answer towards more fine-grained and appropriate measurement in typology. We discuss an implementation of this methodology and provide a case-study involving adnominal word order in a sample of eleven European languages, using a parallel corpus automatically parsed with models from the Universal Dependencies project. By quantifying adnominal word order variability in terms of Shannon’s entropy, we find that the placement of certain nominal modifiers in relation to their head noun is more variable than reported by typological databases , both within and across language genera. Whereas the low variability of placement of articles, adpositions and relative clauses is generally confirmed by our findings, the adnominal ordering of demonstratives and adjectives is more variable than previously reported.

@article{article,
title = {A new methodology for an old problem: A corpus-based typology of adnominal word order in European languages},
author = {Luigi Talamo and Annemarie Verkerk},
url = {https://www.italian-journal-linguistics.com/app/uploads/2023/01/8-Talamo.pdf},
doi = {https://doi.org/10.26346/1120-2726-197},
year = {2022},
date = {2022},
journal = {Italian Journal of Linguistics},
pages = {171-226},
volume = {34},
abstract = {

Linguistic typology is generally characterized by strong data reduction, stemming from the use of binary or categorical classifications. An example are the categories commonly used in describing word order: adjective-noun vs noun-adjective; genitive-noun vs noun-genitive; etc. Token-based typology is part of an answer towards more fine-grained and appropriate measurement in typology. We discuss an implementation of this methodology and provide a case-study involving adnominal word order in a sample of eleven European languages, using a parallel corpus automatically parsed with models from the Universal Dependencies project. By quantifying adnominal word order variability in terms of Shannon's entropy, we find that the placement of certain nominal modifiers in relation to their head noun is more variable than reported by typological databases , both within and across language genera. Whereas the low variability of placement of articles, adpositions and relative clauses is generally confirmed by our findings, the adnominal ordering of demonstratives and adjectives is more variable than previously reported.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C7

Successfully