Publications

Fischer, Stefan; Knappen, Jörg; Menzel, Katrin; Teich, Elke

The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study Inproceedings

Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, pp. 794-802, Marseille, France, 2020.

We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings.

The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases.

We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.

@inproceedings{fischer-EtAl:2020:LREC,
title = {The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study},
author = {Stefan Fischer and J{\"o}rg Knappen and Katrin Menzel and Elke Teich},
url = {https://www.aclweb.org/anthology/2020.lrec-1.99/},
year = {2020},
date = {2020},
booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
pages = {794-802},
publisher = {European Language Resources Association},
address = {Marseille, France},
abstract = {We present a new, extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665–1996). The corpus comprises 47 837 texts, primarily scientific articles, and is based on publications of the Royal Society of London, mainly its Philosophical Transactions and Proceedings. The corpus has been built on the basis of the FAIR principles and is freely available under a Creative Commons license, excluding copy-righted parts. We provide information on how the corpus can be found, the file formats available for download as well as accessibility via a web-based corpus query platform. We show a number of analytic tools that we have implemented for better usability and provide an example of use of the corpus for linguistic analysis as well as examples of subsequent, external uses of earlier releases. We place the RSC against the background of existing English diachronic/scientific corpora, elaborating on its value for linguistic and humanistic study.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Fankhauser, Peter; Teich, Elke

Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach Journal Article

Jurgens, David (Ed.): Frontiers in Artificial Intelligence, section Language and Computation, 2020.

We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665.

Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register).

Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.

@article{Bizzoni2020b,
title = {Linguistic Variation and Change in 250 years of English Scientific Writing: A Data-driven Approach},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Peter Fankhauser and Elke Teich},
editor = {David Jurgens},
url = {https://www.frontiersin.org/articles/10.3389/frai.2020.00073/full},
doi = {https://doi.org/https://doi.org/10.3389/frai.2020.00073},
year = {2020},
date = {2020-10-18},
journal = {Frontiers in Artificial Intelligence, section Language and Computation},
abstract = {We trace the evolution of Scientific English through the Late Modern period to modern time on the basis of a comprehensive corpus composed of the Transactions and Proceedings of the Royal Society of London, the first and longest-running English scientific journal established in 1665. Specifically, we explore the linguistic imprints of specialization and diversification in the science domain which accumulate in the formation of “scientific language” and field-specific sublanguages/registers (chemistry, biology etc.). We pursue an exploratory, data-driven approach using state-of-the-art computational language models and combine them with selected information-theoretic measures (entropy, relative entropy) for comparing models along relevant dimensions of variation (time, register). Focusing on selected linguistic variables (lexis, grammar), we show how we deploy computational language models for capturing linguistic variation and change and discuss benefits and limitations.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Challenges of parsing a historical corpus of Scientific English Miscellaneous

Historical Corpora and Variation (Book of Abstracts), Cagliari, Italy, 2019.

In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and
model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.

@miscellaneous{Juzek2019a,
title = {Challenges of parsing a historical corpus of Scientific English},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://convegni.unica.it/hicov/files/2019/01/Juzek-et-al.pdf},
year = {2019},
date = {2019},
booktitle = {Historical Corpora and Variation (Book of Abstracts)},
address = {Cagliari, Italy},
abstract = {In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches Miscellaneous

52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts), 2019.

@miscellaneous{Juzek2019,
title = {Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
year = {2019},
date = {2019},
booktitle = {52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts)},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century Book Chapter

Moskowich, Isabel; Crespo, Begoña; Puente-Castelo, Luis; Maria Monaco, Leida (Ed.): Writing History in Late Modern English: Explorations of the Coruña Corpus, John Benjamins, pp. 58-81, Amsterdam, 2019.

@inbook{Degaetano-Ortlieb2019b,
title = {Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century},
author = {Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
editor = {Isabel Moskowich and Bego{\~n}a Crespo and Luis Puente-Castelo and Leida Maria Monaco},
url = {https://benjamins.com/catalog/z.225.04deg},
year = {2019},
date = {2019},
booktitle = {Writing History in Late Modern English: Explorations of the Coru{\~n}a Corpus},
pages = {58-81},
publisher = {John Benjamins},
address = {Amsterdam},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Teich, Elke

System and use of wh-relativizers in 200 years of English scientific writing Miscellaneous

10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie
2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).

@miscellaneous{Krielke2019b,
title = {System and use of wh-relativizers in 200 years of English scientific writing},
author = {Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/cl2019_paper_266.pdf},
year = {2019},
date = {2019},
booktitle = {10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie 2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Scheurer, Franziska; Teich, Elke

A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.

@inproceedings{Degaetano-Ortlieb2019b,
title = {A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years},
author = {Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Franziska Scheurer and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstract_that-comp_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania

Hybridization effects in literary texts Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features.

(1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823).

Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.

@inproceedings{Degaetano-Ortlieb2019b,
title = {Hybridization effects in literary texts},
author = {Stefania Degaetano-Ortlieb},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstact_cl2019_hybridization_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features. (1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823). Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Piper, Andrew

The Scientization of Literary Study Inproceedings

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019, Association for Computational Linguistics, pp. 18-28, Minneapolis, MN, USA, 2019.

Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).

@inproceedings{degaetano-ortlieb-piper-2019-scientization,
title = {The Scientization of Literary Study},
author = {Stefania Degaetano-Ortlieb and Andrew Piper},
url = {https://aclanthology.org/W19-2503},
doi = {https://doi.org/10.18653/v1/W19-2503},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019},
pages = {18-28},
publisher = {Association for Computational Linguistics},
address = {Minneapolis, MN, USA},
abstract = {Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Toward an optimal code for communication: the case of scientific English Journal Article

Corpus Linguistics and Linguistic Theory, 18, pp. 1-33, 2019.

We present a model of the linguistic development of scientific English from the mid-seventeenth to the late-nineteenth century, a period that witnessed significant political and social changes, including the evolution of modern science. There is a wealth of descriptive accounts of scientific English, both from a synchronic and a diachronic perspective, but only few attempts at a unified explanation of its evolution. The explanation we offer here is a communicative one: while external pressures (specialization, diversification) push for an increase in expressivity, communicative concerns pull toward convergence on particular options (conventionalization). What emerges over time is a code which is optimized for written, specialist communication, relying on specific linguistic means to modulate information content. As we show, this is achieved by the systematic interplay between lexis and grammar. The corpora we employ are the Royal Society Corpus (RSC) and for comparative purposes, the Corpus of Late Modern English (CLMET). We build various diachronic, computational n-gram language models of these corpora and then apply formal measures of information content (here: relative entropy and surprisal) to detect the linguistic features significantly contributing to diachronic change, estimate the (changing) level of information of features and capture the time course of change.

 

@article{Degaetano-Ortlieb2019b,
title = {Toward an optimal code for communication: the case of scientific English},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://www.degruyter.com/document/doi/10.1515/cllt-2018-0088/html?lang=en},
doi = {https://doi.org/10.1515/cllt-2018-0088},
year = {2019},
date = {2019},
journal = {Corpus Linguistics and Linguistic Theory},
pages = {1-33},
volume = {18},
number = {1},
abstract = {We present a model of the linguistic development of scientific English from the mid-seventeenth to the late-nineteenth century, a period that witnessed significant political and social changes, including the evolution of modern science. There is a wealth of descriptive accounts of scientific English, both from a synchronic and a diachronic perspective, but only few attempts at a unified explanation of its evolution. The explanation we offer here is a communicative one: while external pressures (specialization, diversification) push for an increase in expressivity, communicative concerns pull toward convergence on particular options (conventionalization). What emerges over time is a code which is optimized for written, specialist communication, relying on specific linguistic means to modulate information content. As we show, this is achieved by the systematic interplay between lexis and grammar. The corpora we employ are the Royal Society Corpus (RSC) and for comparative purposes, the Corpus of Late Modern English (CLMET). We build various diachronic, computational n-gram language models of these corpora and then apply formal measures of information content (here: relative entropy and surprisal) to detect the linguistic features significantly contributing to diachronic change, estimate the (changing) level of information of features and capture the time course of change.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

Paradigmatic change and redistribution of functional load: The case of relative clauses in scientific English Miscellaneous

Symposium on Corpus Approaches to Lexicogrammar (Book of Abstracts), Edge Hill University, 2019.

@miscellaneous{Krielke2019,
title = {Paradigmatic change and redistribution of functional load: The case of relative clauses in scientific English},
author = {Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
year = {2019},
date = {2019},
booktitle = {Symposium on Corpus Approaches to Lexicogrammar (Book of Abstracts)},
address = {Edge Hill University},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin; Teich, Elke

Medical discourse across 300 years: insights from the Royal Society Corpus Inproceedings

2nd International Conference on Historical Medical Discourse (CHIMED-2), 2019.

@inproceedings{Menzel2019b,
title = {Medical discourse across 300 years: insights from the Royal Society Corpus},
author = {Katrin Menzel and Elke Teich},
year = {2019},
date = {2019},
booktitle = {2nd International Conference on Historical Medical Discourse (CHIMED-2)},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke; Khamis, Ashraf; Kermes, Hannah

An Information-Theoretic Approach to Modeling Diachronic Change in Scientific English Book Chapter

Suhr, Carla; Nevalainen, Terttu; Taavitsainen, Irma (Ed.): From Data to Evidence in English Language Research, Brill, pp. 258-281, Leiden, 2019.

We present an information-theoretic approach to investigate diachronic change in scientific English. Our main assumption is that over time scientific English has become increasingly dense, i.e. linguistic constructions allowing dense packing of information are progressively used. So far, diachronic change in scientific writing has been investigated by means of frequency-based approaches (see e.g. Halliday (1988); Atkinson (1998); Biber (2006b, c); Biber and Gray (2016); Banks (2008); Taavitsainen and Pahta (2010)). We use information-theoretic measures (entropy, surprisal; Shannon (1949)) to assess features previously stated to change over time and to discover new, latent features from the data itself that are involved in diachronic change. For this, we use the Royal Society Corpus (rsc) (Kermes et al. (2016)), which spans over the time period 1665 to 1869. We present three kinds of analyses: nominal compounding (typical of academic writing), modal verbs (shown to have changed in frequency over time), and an analysis based on part-of-speech trigrams to detect new features that change diachronically. We show how information-theoretic measures help to investigate, evaluate and detect features involved in diachronic change.

@inbook{Degaetano-Ortlieb2019,
title = {An Information-Theoretic Approach to Modeling Diachronic Change in Scientific English},
author = {Stefania Degaetano-Ortlieb and Elke Teich and Ashraf Khamis and Hannah Kermes},
editor = {Carla Suhr and Terttu Nevalainen and Irma Taavitsainen},
url = {https://brill.com/display/book/edcoll/9789004390652/BP000014.xml},
doi = {https://doi.org/10.1163/9789004390652},
year = {2019},
date = {2019},
booktitle = {From Data to Evidence in English Language Research},
pages = {258-281},
publisher = {Brill},
address = {Leiden},
abstract = {We present an information-theoretic approach to investigate diachronic change in scientific English. Our main assumption is that over time scientific English has become increasingly dense, i.e. linguistic constructions allowing dense packing of information are progressively used. So far, diachronic change in scientific writing has been investigated by means of frequency-based approaches (see e.g. Halliday (1988); Atkinson (1998); Biber (2006b, c); Biber and Gray (2016); Banks (2008); Taavitsainen and Pahta (2010)). We use information-theoretic measures (entropy, surprisal; Shannon (1949)) to assess features previously stated to change over time and to discover new, latent features from the data itself that are involved in diachronic change. For this, we use the Royal Society Corpus (rsc) (Kermes et al. (2016)), which spans over the time period 1665 to 1869. We present three kinds of analyses: nominal compounding (typical of academic writing), modal verbs (shown to have changed in frequency over time), and an analysis based on part-of-speech trigrams to detect new features that change diachronically. We show how information-theoretic measures help to investigate, evaluate and detect features involved in diachronic change.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Mosbach, Marius; Klakow, Dietrich; Degaetano-Ortlieb, Stefania

Some steps towards the generation of diachronic WordNets Inproceedings

Proceedings of the 22nd Nordic Conference on Computational Linguistics, Linköping University Electronic Press, Turku, Finland, 2019.

We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task.

Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.

@inproceedings{bizzoni-etal-2019-steps,
title = {Some steps towards the generation of diachronic WordNets},
author = {Yuri Bizzoni and Marius Mosbach and Dietrich Klakow and Stefania Degaetano-Ortlieb},
url = {https://www.aclweb.org/anthology/W19-6106},
year = {2019},
date = {2019-10-02},
booktitle = {Proceedings of the 22nd Nordic Conference on Computational Linguistics},
publisher = {Link{\"o}ping University Electronic Press},
address = {Turku, Finland},
abstract = {We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task. Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Krielke, Marie-Pauline; Teich, Elke

Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings Inproceedings

Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, pp. 175-185, Florence, Italy, 2019.

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.

@inproceedings{Bizzoni2019,
title = {Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Katrin Menzel and Marie-Pauline Krielke and Elke Teich},
url = {https://aclanthology.org/W19-4722},
doi = {https://doi.org/10.18653/v1/W19-4722},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change},
pages = {175-185},
publisher = {Association for Computational Linguistics},
address = {Florence, Italy},
abstract = {The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin

Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) , Bremen, Germany, 2019.

This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.

@inproceedings{Menzel2019,
title = {Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing},
author = {Katrin Menzel},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Menzel.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Fischer, Stefan; Teich, Elke

More complex or just more diverse? Capturing diachronic linguistic variation Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS), Bremen, Germany, 2019.

We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).

@inproceedings{Fischer2019,
title = {More complex or just more diverse? Capturing diachronic linguistic variation},
author = {Stefan Fischer and Elke Teich},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Fischer_Teich.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Using relative entropy for detection and analysis of periods of diachronic linguistic change Inproceedings

Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018, Association for Computational Linguistics , pp. 22-33, Santa Fe, New Mexico, 2018.

We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.

@inproceedings{Degaetano-Ortlieb2018b,
title = {Using relative entropy for detection and analysis of periods of diachronic linguistic change},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W18-4503},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018},
pages = {22-33},
publisher = {Association for Computational Linguistics},
address = {Santa Fe, New Mexico},
abstract = {We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Teich, Elke; Fankhauser, Peter

Aspects of Linguistic and Computational Modeling in Language Science Book Chapter

Flanders, Julia; Jannidis, Fotis (Ed.): The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities). , Routledge, Taylor & Francis, pp. 236-249, New York, 2018.

Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.

@inbook{Teich2018,
title = {Aspects of Linguistic and Computational Modeling in Language Science},
author = {Elke Teich and Peter Fankhauser},
editor = {Julia Flanders and Fotis Jannidis},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/34320},
year = {2018},
date = {2018},
booktitle = {The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities).},
pages = {236-249},
publisher = {Routledge, Taylor & Francis},
address = {New York},
abstract = {Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Strötgen, Jannik

Diachronic variation of temporal expressions in scientific writing through the lens of relative entropy Inproceedings

Rehm, Georg; Declerck, Thierry (Ed.): Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, September 13-14, Proceedings. Lecture Notes in Computer Science, 10713, Springer International Publishing, pp. 250-275, Berlin, Germany, 2018.

The abundance of temporal information in documents has lead to an increased interest in processing such information in the NLP community by considering temporal expressions. Besides domain-adaptation, acquiring knowledge on variation of temporal expressions according to time is relevant for improvement in automatic processing. So far, frequency-based accounts dominate in the investigation of specific temporal expressions. We present an approach to investigate diachronic changes of temporal expressions based on relative entropy – with the advantage of using conditioned probabilities rather than mere frequency. While we focus on scientific writing, our approach is generalizable to other domains and interesting not only in the field of NLP, but also in humanities.

@inproceedings{Degaetano-Ortlieb2018b,
title = {Diachronic variation of temporal expressions in scientific writing through the lens of relative entropy},
author = {Stefania Degaetano-Ortlieb and Jannik Str{\"o}tgen},
editor = {Georg Rehm and Thierry Declerck},
url = {https://link.springer.com/chapter/10.1007/978-3-319-73706-5_22},
year = {2018},
date = {2018},
booktitle = {Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, September 13-14, Proceedings. Lecture Notes in Computer Science},
pages = {250-275},
publisher = {Springer International Publishing},
address = {Berlin, Germany},
abstract = {The abundance of temporal information in documents has lead to an increased interest in processing such information in the NLP community by considering temporal expressions. Besides domain-adaptation, acquiring knowledge on variation of temporal expressions according to time is relevant for improvement in automatic processing. So far, frequency-based accounts dominate in the investigation of specific temporal expressions. We present an approach to investigate diachronic changes of temporal expressions based on relative entropy – with the advantage of using conditioned probabilities rather than mere frequency. While we focus on scientific writing, our approach is generalizable to other domains and interesting not only in the field of NLP, but also in humanities.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Successfully