Publications

Degaetano-Ortlieb, Stefania; Säily, Tanja; Bizzoni, Yuri

Registerial Adaptation vs. Innovation Across Situational Contexts: 18th Century Women in Transition Journal Article

Frontiers in Artificial Intelligence, section Language and Computation, 4, 2021.

Endeavors to computationally model language variation and change are ever increasing. While analyses of recent diachronic trends are frequently conducted, long-term trends accounting for sociolinguistic variation are less well-studied. Our work sheds light on the temporal dynamics of language use of British 18th century women as a group in transition across two situational contexts. Our findings reveal that in formal contexts women adapt to register conventions, while in informal contexts they act as innovators of change in language use influencing others. While adopted from other disciplines, our methods inform (historical) sociolinguistic work in novel ways. These methods include diachronic periodization by Kullback-Leibler divergence to determine periods of change and relevant features of variation, and event cascades as influencer models.

@article{Degaetano-Ortlieb2021,
title = {Registerial Adaptation vs. Innovation Across Situational Contexts: 18th Century Women in Transition},
author = {Stefania Degaetano-Ortlieb and Tanja S{\"a}ily and Yuri Bizzoni},
url = {https://www.frontiersin.org/article/10.3389/frai.2021.609970},
doi = {https://doi.org/10.3389/frai.2021.609970},
year = {2021},
date = {2021},
journal = {Frontiers in Artificial Intelligence, section Language and Computation},
volume = {4},
abstract = {Endeavors to computationally model language variation and change are ever increasing. While analyses of recent diachronic trends are frequently conducted, long-term trends accounting for sociolinguistic variation are less well-studied. Our work sheds light on the temporal dynamics of language use of British 18th century women as a group in transition across two situational contexts. Our findings reveal that in formal contexts women adapt to register conventions, while in informal contexts they act as innovators of change in language use influencing others. While adopted from other disciplines, our methods inform (historical) sociolinguistic work in novel ways. These methods include diachronic periodization by Kullback-Leibler divergence to determine periods of change and relevant features of variation, and event cascades as influencer models.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline

Relativizers as markers of grammatical complexity: A diachronic, cross-register study of English and German Journal Article

Bergen Language and Linguistics Studies, 11, pp. 91-120, 2021.

In this paper, we investigate grammatical complexity as a register feature of scientific English and German. Specifically, we carry out a diachronic comparison between general and scientific discourse in the two languages from the 17th to the 19th century, using relativizers as proxies for grammatical complexity. We ground our study in register theory (Halliday and Hasan, 1985), assuming that language use reflects contextual factors, which contribute to the formation of registers (Quirk et al., 1985; Biber et al., 1999; Teich et al., 2016). Our findings show a clear tendency towards grammatical simplification in scientific discourse in both languages with English spearheading the trend early on and German following later.

@article{Krielke2021relativizers,
title = {Relativizers as markers of grammatical complexity: A diachronic, cross-register study of English and German},
author = {Marie-Pauline Krielke},
url = {https://doi.org/10.15845/bells.v11i1.3440},
doi = {https://doi.org/10.15845/bells.v11i1.3440},
year = {2021},
date = {2021-09-15},
journal = {Bergen Language and Linguistics Studies},
pages = {91-120},
volume = {11},
number = {1},
abstract = {In this paper, we investigate grammatical complexity as a register feature of scientific English and German. Specifically, we carry out a diachronic comparison between general and scientific discourse in the two languages from the 17th to the 19th century, using relativizers as proxies for grammatical complexity. We ground our study in register theory (Halliday and Hasan, 1985), assuming that language use reflects contextual factors, which contribute to the formation of registers (Quirk et al., 1985; Biber et al., 1999; Teich et al., 2016). Our findings show a clear tendency towards grammatical simplification in scientific discourse in both languages with English spearheading the trend early on and German following later.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin; Knappen, Jörg; Teich, Elke

Generating linguistically relevant metadata for the Royal Society Corpus Journal Article

Säily, Tanja; Tyrkkö, Jukka (Ed.): Research in Corpus Linguistics, Challenges in combining structured and unstructured data in corpus development (special issue), 9, pp. 1-18, 2021, ISSN 2243-4712.

This paper provides an overview of metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register).

@article{Menzel2021,
title = {Generating linguistically relevant metadata for the Royal Society Corpus},
author = {Katrin Menzel and J{\"o}rg Knappen and Elke Teich},
editor = {Tanja S{\"a}ily and Jukka Tyrkk{\"o}},
url = {https://ricl.aelinco.es/index.php/ricl/article/view/158},
doi = {https://doi.org/10.32714/ricl.09.01.02},
year = {2021},
date = {2021},
journal = {Research in Corpus Linguistics, Challenges in combining structured and unstructured data in corpus development (special issue)},
pages = {1-18},
volume = {9},
number = {1},
abstract = {This paper provides an overview of metadata generation and management for the Royal Society Corpus (RSC), aiming to encourage discussion about the specific challenges in building substantial diachronic corpora intended to be used for linguistic and humanistic analysis. We discuss the motivations and goals of building the corpus, describe its composition and present the types of metadata it contains. Specifically, we tackle two challenges: first, integration of original metadata from the data providers (JSTOR and the Royal Society); second, derivation of additional linguistically relevant metadata regarding text structure and situational context (register).},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Teich, Elke; Fankhauser, Peter; Degaetano-Ortlieb, Stefania; Bizzoni, Yuri

Less is More/More Diverse: On The Communicative Utility of Linguistic Conventionalization Journal Article

Benîtez-Burraco, Antonio (Ed.): Frontiers in Communication, section Language Sciences, 2021.

We present empirical evidence of the communicative utility of CONVENTIONALIZATION, i.e., convergence in linguistic usage over time, and DIVERSIFICATION, i.e., linguistic items acquiring different, more specific usages/meanings. From a diachronic perspective, conventionalization plays a crucial role in language change as a condition for innovation and grammaticalization (Bybee, 2010; Schmid, 2015) and diversification is a cornerstone in the formation of sublanguages/registers, i.e., functional linguistic varieties (Halliday, 1988; Harris, 1991). While it is widely acknowledged that change in language use is primarily socio-culturally determined pushing towards greater linguistic expressivity, we here highlight the limiting function of communicative factors on diachronic linguistic variation showing that conventionalization and diversification are associated with a reduction of linguistic variability. To be able to observe effects of linguistic variability reduction, we first need a well-defined notion of choice in context. Linguistically, this implies the paradigmatic axis of linguistic organization, i.e., the sets of linguistic options available in a given or similar syntagmatic contexts. Here, we draw on word embeddings, weakly neural distributional language models that have recently been employed to model lexicalsemantic change and allow us to approximate the notion of paradigm by neighbourhood in vector space. Second, we need to capture changes in paradigmatic variability, i.e. reduction/expansion of linguistic options in a given context. As a formal index of paradigmatic variability we use entropy, which measures the contribution of linguistic units (e.g., words) in predicting linguistic choice in bits of information. Using entropy provides us with a link to a communicative interpretation, as it is a well-established measure of communicative efficiency with implications for cognitive processing (Linzen and Jaeger, 2016; Venhuizen et al., 2019); also, entropy is negatively correlated with distance in (word embedding) spaces which in turn shows cognitive reflexes in certain language processing tasks (Mitchel et al., 2008; Auguste et al., 2017). In terms of domain we focus on science, looking at the diachronic development of scientific English from the 17th century to modern time. This provides us with a fairly constrained yet dynamic domain of discourse that has witnessed a powerful systematization throughout the centuries and developed specific linguistic conventions geared towards efficient communication. Overall, our study confirms the assumed trends of conventionalization and diversification shown by diachronically decreasing entropy, interspersed with local, temporary entropy highs pointing to phases of linguistic expansion pertaining primarily to introduction of new technical terminology.

@article{Teich2021,
title = {Less is More/More Diverse: On The Communicative Utility of Linguistic Conventionalization},
author = {Elke Teich and Peter Fankhauser and Stefania Degaetano-Ortlieb and Yuri Bizzoni},
editor = {Antonio Benîtez-Burraco},
url = {https://www.frontiersin.org/articles/10.3389/fcomm.2020.620275/full?&utm_source=Email_to_authors_&utm_medium=Email&utm_content=T1_11.5e1_author&utm_campaign=Email_publication&field=&journalName=Frontiers_in_Communication&id=620275},
doi = {https://doi.org/10.3389/fcomm.2020.620275},
year = {2021},
date = {2021-01-26},
journal = {Frontiers in Communication, section Language Sciences},
abstract = {We present empirical evidence of the communicative utility of CONVENTIONALIZATION, i.e., convergence in linguistic usage over time, and DIVERSIFICATION, i.e., linguistic items acquiring different, more specific usages/meanings. From a diachronic perspective, conventionalization plays a crucial role in language change as a condition for innovation and grammaticalization (Bybee, 2010; Schmid, 2015) and diversification is a cornerstone in the formation of sublanguages/registers, i.e., functional linguistic varieties (Halliday, 1988; Harris, 1991). While it is widely acknowledged that change in language use is primarily socio-culturally determined pushing towards greater linguistic expressivity, we here highlight the limiting function of communicative factors on diachronic linguistic variation showing that conventionalization and diversification are associated with a reduction of linguistic variability. To be able to observe effects of linguistic variability reduction, we first need a well-defined notion of choice in context. Linguistically, this implies the paradigmatic axis of linguistic organization, i.e., the sets of linguistic options available in a given or similar syntagmatic contexts. Here, we draw on word embeddings, weakly neural distributional language models that have recently been employed to model lexicalsemantic change and allow us to approximate the notion of paradigm by neighbourhood in vector space. Second, we need to capture changes in paradigmatic variability, i.e. reduction/expansion of linguistic options in a given context. As a formal index of paradigmatic variability we use entropy, which measures the contribution of linguistic units (e.g., words) in predicting linguistic choice in bits of information. Using entropy provides us with a link to a communicative interpretation, as it is a well-established measure of communicative efficiency with implications for cognitive processing (Linzen and Jaeger, 2016; Venhuizen et al., 2019); also, entropy is negatively correlated with distance in (word embedding) spaces which in turn shows cognitive reflexes in certain language processing tasks (Mitchel et al., 2008; Auguste et al., 2017). In terms of domain we focus on science, looking at the diachronic development of scientific English from the 17th century to modern time. This provides us with a fairly constrained yet dynamic domain of discourse that has witnessed a powerful systematization throughout the centuries and developed specific linguistic conventions geared towards efficient communication. Overall, our study confirms the assumed trends of conventionalization and diversification shown by diachronically decreasing entropy, interspersed with local, temporary entropy highs pointing to phases of linguistic expansion pertaining primarily to introduction of new technical terminology.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Mosbach, Marius; Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Abdullah, Badr M.; Klakow, Dietrich

A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English Inproceedings

Proceedings of the 28th International Conference on Computational Linguistics, pp. 771-787, 2020.

Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

@inproceedings{Mosbach2020,
title = {A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English},
author = {Marius Mosbach and Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Badr M. Abdullah and Dietrich Klakow},
url = {https://aclanthology.org/2020.coling-main.67/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
pages = {771-787},
abstract = {Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models’ performance. Our results highlight the importance of (a) model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B1 B4 C4

Juzek, Tom; Krielke, Marie-Pauline; Teich, Elke

Exploring diachronic syntactic shifts with dependency length: the case of scientific English Inproceedings

Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), Association for Computational Linguistics, pp. 109-119, Barcelona, Spain (Online), 2020.

We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.

@inproceedings{juzek-etal-2020-exploring,
title = {Exploring diachronic syntactic shifts with dependency length: the case of scientific English},
author = {Tom Juzek and Marie-Pauline Krielke and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.udw-1.13},
year = {2020},
date = {2020},
booktitle = {Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)},
pages = {109-119},
publisher = {Association for Computational Linguistics},
address = {Barcelona, Spain (Online)},
abstract = {We report on an application of universal dependencies for the study of diachronic shifts in syntactic usage patterns. Our focus is on the evolution of Scientific English in the Late Modern English period (ca. 1700-1900). Our data set is the Royal Society Corpus (RSC), comprising the full set of publications of the Royal Society of London between 1665 and 1996. Our starting assumption is that over time, Scientific English develops specific syntactic choice preferences that increase efficiency in (expert-to-expert) communication. The specific hypothesis we pursue in this paper is that changing syntactic choice preferences lead to greater dependency locality/dependency length minimization, which is associated with positive effects for the efficiency of human as well as computational linguistic processing. As a basis for our measurements, we parsed the RSC using Stanford CoreNLP. Overall, we observe a decrease in dependency length, with long dependency structures becoming less frequent and short dependency structures becoming more frequent over time, notably pertaining to the nominal phrase, thus marking an overall push towards greater communicative efficiency.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Teich, Elke

Language variation and change: A communicative perspective Miscellaneous

Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft, DGfS 2020, Hamburg, 2020.

It is widely acknowledged that language use and language structure are closely interlinked, linguistic structure emerging from language use (Bybee & Hopper 2001). Language use, in turn, is characterized by variation; in fact, speakers’ ability to adapt to changing contexts is a prerequisite for language to be functional (Weinreich et al. 1968).

Taking the perspective of rational communication, in my talk I will revisit some core questions of diachronic linguistic change: Why does a change happen? Which features are involved in change? How does change proceed? What are the eff ects of change? Recent work on online human language use reveals that speakers try to optimize their linguistic productions by encoding their messages with uniform information density (see Crocker et al. 2016 for an overview). Here, a major determinant in linguistic choice is predictability in context. Predictability in context is commonly represented by information content measured in bits (Shannon information): The more predictable a linguistic unit (e.g. word) is in a given context, the fewer bits are needed for encoding and the shorter its linguistic encoding may be (and vice versa, the more “surprising” a unit is in a given context, the more bits are needed for encoding and the more explicit its encoding tends to be). In this view, one major function of linguistic variation is to modulate information content so as to optimize message transmission.

In my talk, I apply this perspective to diachronic linguistic change. I show that speakers’ continuous adaptation to changing contextual conditions pushes towards linguistic innovation and results in temporary, high levels of expressivity, but the concern for maintaining communicative function pulls towards convergence and results in conventionalization. The diachronic scenario I discuss is mid-term change (200–250 years) in English in the late Modern period, focusing on the discourse domain of science (Degaetano-Ortlieb & Teich 2019). In terms of methods, I use computational language models to estimate predictability in context; and to assess diachronic change, I apply selected measures of information content, including entropy and surprisal.

@miscellaneous{Teich2020a,
title = {Language variation and change: A communicative perspective},
author = {Elke Teich},
url = {https://www.zfs.uni-hamburg.de/en/dgfs2020/programm/keynotes/elke-teich.html},
year = {2020},
date = {2020-11-04},
booktitle = {Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft, DGfS 2020},
address = {Hamburg},
abstract = {It is widely acknowledged that language use and language structure are closely interlinked, linguistic structure emerging from language use (Bybee & Hopper 2001). Language use, in turn, is characterized by variation; in fact, speakers’ ability to adapt to changing contexts is a prerequisite for language to be functional (Weinreich et al. 1968). Taking the perspective of rational communication, in my talk I will revisit some core questions of diachronic linguistic change: Why does a change happen? Which features are involved in change? How does change proceed? What are the eff ects of change? Recent work on online human language use reveals that speakers try to optimize their linguistic productions by encoding their messages with uniform information density (see Crocker et al. 2016 for an overview). Here, a major determinant in linguistic choice is predictability in context. Predictability in context is commonly represented by information content measured in bits (Shannon information): The more predictable a linguistic unit (e.g. word) is in a given context, the fewer bits are needed for encoding and the shorter its linguistic encoding may be (and vice versa, the more “surprising” a unit is in a given context, the more bits are needed for encoding and the more explicit its encoding tends to be). In this view, one major function of linguistic variation is to modulate information content so as to optimize message transmission. In my talk, I apply this perspective to diachronic linguistic change. I show that speakers’ continuous adaptation to changing contextual conditions pushes towards linguistic innovation and results in temporary, high levels of expressivity, but the concern for maintaining communicative function pulls towards convergence and results in conventionalization. The diachronic scenario I discuss is mid-term change (200–250 years) in English in the late Modern period, focusing on the discourse domain of science (Degaetano-Ortlieb & Teich 2019). In terms of methods, I use computational language models to estimate predictability in context; and to assess diachronic change, I apply selected measures of information content, including entropy and surprisal.},
note = {Key note},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Fischer, Stefan; Knappen, Jörg; Menzel, Katrin; Teich, Elke

The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study Inproceedings

Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, pp. 794-802, Marseille, France, 2020.

We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings.

The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases.

We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.

@inproceedings{fischer-EtAl:2020:LREC,
title = {The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study},
author = {Stefan Fischer and J{\"o}rg Knappen and Katrin Menzel and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.lrec-1.99/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
pages = {794-802},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Fankhauser, Peter; Teich, Elke

Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach Journal Article

Jurgens, David (Ed.): Frontiers in Artificial Intelligence, section Language and Computation, 2020.

We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665.

Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register).

Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.

@article{Bizzoni2020b,
title = {Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Peter Fankhauser and Elke Teich},
editor = {David Jurgens},
url = {https://www.frontiersin.org/articles/10.3389/frai.2020.00073/full},
doi = {https://doi.org/https://doi.org/10.3389/frai.2020.00073},
year = {2020},
date = {2020-10-18},
journal = {Frontiers in Artificial Intelligence, section Language and Computation},
abstract = {We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Challenges of parsing a historical corpus of Scientific English Miscellaneous

Historical Corpora and Variation (Book of Abstracts), Cagliari, Italy, 2019.

In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and
model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.

@miscellaneous{Juzek2019a,
title = {Challenges of parsing a historical corpus of Scientific English},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://convegni.unica.it/hicov/files/2019/01/Juzek-et-al.pdf},
year = {2019},
date = {2019},
booktitle = {Historical Corpora and Variation (Book of Abstracts)},
address = {Cagliari, Italy},
abstract = {In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches Miscellaneous

52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts), 2019.

@miscellaneous{Juzek2019,
title = {Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
year = {2019},
date = {2019},
booktitle = {52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts)},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century Book Chapter

Moskowich, Isabel; Crespo, Begoña; Puente-Castelo, Luis; Maria Monaco, Leida (Ed.): Writing History in Late Modern English: Explorations of the Coruña Corpus, John Benjamins, pp. 58-81, Amsterdam, 2019.

@inbook{Degaetano-Ortlieb2019b,
title = {Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century},
author = {Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
editor = {Isabel Moskowich and Bego{\~n}a Crespo and Luis Puente-Castelo and Leida Maria Monaco},
url = {https://benjamins.com/catalog/z.225.04deg},
year = {2019},
date = {2019},
booktitle = {Writing History in Late Modern English: Explorations of the Coru{\~n}a Corpus},
pages = {58-81},
publisher = {John Benjamins},
address = {Amsterdam},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Teich, Elke

System and use of wh-relativizers in 200 years of English scientific writing Miscellaneous

10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie
2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).

@miscellaneous{Krielke2019b,
title = {System and use of wh-relativizers in 200 years of English scientific writing},
author = {Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/cl2019_paper_266.pdf},
year = {2019},
date = {2019},
booktitle = {10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie 2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Scheurer, Franziska; Teich, Elke

A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.

@inproceedings{Degaetano-Ortlieb2019b,
title = {A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years},
author = {Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Franziska Scheurer and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstract_that-comp_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania

Hybridization effects in literary texts Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features.

(1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823).

Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.

@inproceedings{Degaetano-Ortlieb2019b,
title = {Hybridization effects in literary texts},
author = {Stefania Degaetano-Ortlieb},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstact_cl2019_hybridization_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features. (1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823). Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Piper, Andrew

The Scientization of Literary Study Inproceedings

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019, Association for Computational Linguistics, pp. 18-28, Minneapolis, MN, USA, 2019.

Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).

@inproceedings{degaetano-ortlieb-piper-2019-scientization,
title = {The Scientization of Literary Study},
author = {Stefania Degaetano-Ortlieb and Andrew Piper},
url = {https://aclanthology.org/W19-2503},
doi = {https://doi.org/10.18653/v1/W19-2503},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019},
pages = {18-28},
publisher = {Association for Computational Linguistics},
address = {Minneapolis, MN, USA},
abstract = {Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Toward an optimal code for communication: the case of scientific English Journal Article

Corpus Linguistics and Linguistic Theory, 18, pp. 1-33, 2019.

We present a model of the linguistic development of scientific English from the mid-seventeenth to the late-nineteenth century, a period that witnessed significant political and social changes, including the evolution of modern science. There is a wealth of descriptive accounts of scientific English, both from a synchronic and a diachronic perspective, but only few attempts at a unified explanation of its evolution. The explanation we offer here is a communicative one: while external pressures (specialization, diversification) push for an increase in expressivity, communicative concerns pull toward convergence on particular options (conventionalization). What emerges over time is a code which is optimized for written, specialist communication, relying on specific linguistic means to modulate information content. As we show, this is achieved by the systematic interplay between lexis and grammar. The corpora we employ are the Royal Society Corpus (RSC) and for comparative purposes, the Corpus of Late Modern English (CLMET). We build various diachronic, computational n-gram language models of these corpora and then apply formal measures of information content (here: relative entropy and surprisal) to detect the linguistic features significantly contributing to diachronic change, estimate the (changing) level of information of features and capture the time course of change.

 

@article{Degaetano-Ortlieb2019b,
title = {Toward an optimal code for communication: the case of scientific English},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://www.degruyter.com/document/doi/10.1515/cllt-2018-0088/html?lang=en},
doi = {https://doi.org/10.1515/cllt-2018-0088},
year = {2019},
date = {2019},
journal = {Corpus Linguistics and Linguistic Theory},
pages = {1-33},
volume = {18},
number = {1},
abstract = {We present a model of the linguistic development of scientific English from the mid-seventeenth to the late-nineteenth century, a period that witnessed significant political and social changes, including the evolution of modern science. There is a wealth of descriptive accounts of scientific English, both from a synchronic and a diachronic perspective, but only few attempts at a unified explanation of its evolution. The explanation we offer here is a communicative one: while external pressures (specialization, diversification) push for an increase in expressivity, communicative concerns pull toward convergence on particular options (conventionalization). What emerges over time is a code which is optimized for written, specialist communication, relying on specific linguistic means to modulate information content. As we show, this is achieved by the systematic interplay between lexis and grammar. The corpora we employ are the Royal Society Corpus (RSC) and for comparative purposes, the Corpus of Late Modern English (CLMET). We build various diachronic, computational n-gram language models of these corpora and then apply formal measures of information content (here: relative entropy and surprisal) to detect the linguistic features significantly contributing to diachronic change, estimate the (changing) level of information of features and capture the time course of change.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

Paradigmatic change and redistribution of functional load: The case of relative clauses in scientific English Miscellaneous

Symposium on Corpus Approaches to Lexicogrammar (Book of Abstracts), Edge Hill University, 2019.

@miscellaneous{Krielke2019,
title = {Paradigmatic change and redistribution of functional load: The case of relative clauses in scientific English},
author = {Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
year = {2019},
date = {2019},
booktitle = {Symposium on Corpus Approaches to Lexicogrammar (Book of Abstracts)},
address = {Edge Hill University},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin; Teich, Elke

Medical discourse across 300 years: insights from the Royal Society Corpus Inproceedings

2nd International Conference on Historical Medical Discourse (CHIMED-2), 2019.

@inproceedings{Menzel2019b,
title = {Medical discourse across 300 years: insights from the Royal Society Corpus},
author = {Katrin Menzel and Elke Teich},
year = {2019},
date = {2019},
booktitle = {2nd International Conference on Historical Medical Discourse (CHIMED-2)},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke; Khamis, Ashraf; Kermes, Hannah

An Information-Theoretic Approach to Modeling Diachronic Change in Scientific English Book Chapter

Suhr, Carla; Nevalainen, Terttu; Taavitsainen, Irma (Ed.): From Data to Evidence in English Language Research, Brill, pp. 258-281, Leiden, 2019.

We present an information-theoretic approach to investigate diachronic change in scientific English. Our main assumption is that over time scientific English has become increasingly dense, i.e. linguistic constructions allowing dense packing of information are progressively used. So far, diachronic change in scientific writing has been investigated by means of frequency-based approaches (see e.g. Halliday (1988); Atkinson (1998); Biber (2006b, c); Biber and Gray (2016); Banks (2008); Taavitsainen and Pahta (2010)). We use information-theoretic measures (entropy, surprisal; Shannon (1949)) to assess features previously stated to change over time and to discover new, latent features from the data itself that are involved in diachronic change. For this, we use the Royal Society Corpus (rsc) (Kermes et al. (2016)), which spans over the time period 1665 to 1869. We present three kinds of analyses: nominal compounding (typical of academic writing), modal verbs (shown to have changed in frequency over time), and an analysis based on part-of-speech trigrams to detect new features that change diachronically. We show how information-theoretic measures help to investigate, evaluate and detect features involved in diachronic change.

@inbook{Degaetano-Ortlieb2019,
title = {An Information-Theoretic Approach to Modeling Diachronic Change in Scientific English},
author = {Stefania Degaetano-Ortlieb and Elke Teich and Ashraf Khamis and Hannah Kermes},
editor = {Carla Suhr and Terttu Nevalainen and Irma Taavitsainen},
url = {https://brill.com/display/book/edcoll/9789004390652/BP000014.xml},
doi = {https://doi.org/10.1163/9789004390652},
year = {2019},
date = {2019},
booktitle = {From Data to Evidence in English Language Research},
pages = {258-281},
publisher = {Brill},
address = {Leiden},
abstract = {We present an information-theoretic approach to investigate diachronic change in scientific English. Our main assumption is that over time scientific English has become increasingly dense, i.e. linguistic constructions allowing dense packing of information are progressively used. So far, diachronic change in scientific writing has been investigated by means of frequency-based approaches (see e.g. Halliday (1988); Atkinson (1998); Biber (2006b, c); Biber and Gray (2016); Banks (2008); Taavitsainen and Pahta (2010)). We use information-theoretic measures (entropy, surprisal; Shannon (1949)) to assess features previously stated to change over time and to discover new, latent features from the data itself that are involved in diachronic change. For this, we use the Royal Society Corpus (rsc) (Kermes et al. (2016)), which spans over the time period 1665 to 1869. We present three kinds of analyses: nominal compounding (typical of academic writing), modal verbs (shown to have changed in frequency over time), and an analysis based on part-of-speech trigrams to detect new features that change diachronically. We show how information-theoretic measures help to investigate, evaluate and detect features involved in diachronic change.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Successfully