Publications - SFB 1102

Degaetano-Ortlieb, Stefania; Teich, Elke

Information-based modeling of diachronic linguistic change: from typicality to productivity Inproceedings

Proceedings of Language Technologies for the Socio-Economic Sciences and Humanities (LATECH'16), Association for Computational Linguistics (ACL), Association for Computational Linguistics, pp. 165-173, Berlin, Germany, 2016.

Abstract
|
Links
|
BibTeX

We present a new approach for modeling diachronic linguistic change in grammatical usage. We illustrate the approach on English scientific writing in Late Modern English, focusing on grammatical patterns that are potentially indicative of shifts in register, genre and/or style. Commonly, diachronic change is characterized by the relative frequency of typical linguistic features over time. However, to fully capture changing linguistic usage, feature productivity needs to be taken into account as well. We introduce a data-driven approach for systematically detecting typical features and assessing their productivity over time, using information-theoretic
measures of entropy and surprisal.

@inproceedings{Degaetano-Ortlieb2016a,
title = {Information-based modeling of diachronic linguistic change: from typicality to productivity},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W16-2121},
doi = {https://doi.org/10.18653/v1/W16-2121},
year = {2016},
date = {2016},
booktitle = {Proceedings of Language Technologies for the Socio-Economic Sciences and Humanities (LATECH'16), Association for Computational Linguistics (ACL)},
pages = {165-173},
publisher = {Association for Computational Linguistics},
address = {Berlin, Germany},
abstract = {We present a new approach for modeling diachronic linguistic change in grammatical usage. We illustrate the approach on English scientific writing in Late Modern English, focusing on grammatical patterns that are potentially indicative of shifts in register, genre and/or style. Commonly, diachronic change is characterized by the relative frequency of typical linguistic features over time. However, to fully capture changing linguistic usage, feature productivity needs to be taken into account as well. We introduce a data-driven approach for systematically detecting typical features and assessing their productivity over time, using information-theoretic measures of entropy and surprisal.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Ordan, Noam; Teich, Elke

The taming of the data: Using text mining in building a corpus for diachronic analysis Inproceedings

Varieng - From Data to Evidence (d2e), University of Helsinki, 2015.

Abstract
|
Links
|
BibTeX

Social and historical linguistic studies benefit from corpora encoding contextual metadata (e.g. time, register, genre) and relevant structural information (e.g. document structure). While small, handcrafted corpora control over selected contextual variables (e.g. the Brown/LOB corpora encoding variety, register, and time) and are readily usable for analysis, big data (e.g. Google or Microsoft n-grams) are typically poorly contextualized and considered of limited value for linguistic analysis (see, however, Lieberman et al. 2007). Similarly, when we compile new corpora, sources may not contain all relevant metadata and structural data (e.g. the Old Bailey sources vs. the richly annotated corpus in Huber 2007).

https://www.ashrafkhamis.com/d2e2015.pdf

@inproceedings{Degaetano-etal2015,
title = {The taming of the data: Using text mining in building a corpus for diachronic analysis},
author = {Stefania Degaetano-Ortlieb and Hannah Kermes and Ashraf Khamis and Noam Ordan and Elke Teich},
url = {https://www.ashrafkhamis.com/d2e2015.pdf},
year = {2015},
date = {2015-10-01},
booktitle = {Varieng - From Data to Evidence (d2e)},
address = {University of Helsinki},
abstract = {Social and historical linguistic studies benefit from corpora encoding contextual metadata (e.g. time, register, genre) and relevant structural information (e.g. document structure). While small, handcrafted corpora control over selected contextual variables (e.g. the Brown/LOB corpora encoding variety, register, and time) and are readily usable for analysis, big data (e.g. Google or Microsoft n-grams) are typically poorly contextualized and considered of limited value for linguistic analysis (see, however, Lieberman et al. 2007). Similarly, when we compile new corpora, sources may not contain all relevant metadata and structural data (e.g. the Old Bailey sources vs. the richly annotated corpus in Huber 2007).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Khamis, Ashraf; Degaetano-Ortlieb, Stefania; Kermes, Hannah; Knappen, Jörg; Ordan, Noam; Teich, Elke

A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus Inproceedings

Corpus Linguistics 2015, Lancaster, 2015.

Abstract
|
Links
|
BibTeX

There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.

https://www.researchgate.net/publication/331648570_A_resource_for_the_diachronic_study_of_scientific_English_Introducing_the_Royal_Society_Corpus

@inproceedings{Khamis-etal2015,
title = {A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus},
author = {Ashraf Khamis and Stefania Degaetano-Ortlieb and Hannah Kermes and J{\"o}rg Knappen and Noam Ordan and Elke Teich},
url = {https://www.researchgate.net/publication/331648570_A_resource_for_the_diachronic_study_of_scientific_English_Introducing_the_Royal_Society_Corpus},
year = {2015},
date = {2015-07-01},
booktitle = {Corpus Linguistics 2015},
address = {Lancaster},
abstract = {

There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Knappen, Jörg; Teich, Elke

Information Density in Scientific Writing: A Diachronic Perspective Inproceedings

"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015), RWTH Aachen University, 2015.

Abstract
|
Links
|
BibTeX

We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).

https://www.researchgate.net/publication/331648534_Information_Density_in_Scientific_Writing_A_Diachronic_Perspective

@inproceedings{Degaetano-etal2015b,
title = {Information Density in Scientific Writing: A Diachronic Perspective},
author = {Stefania Degaetano-Ortlieb and Hannah Kermes and Ashraf Khamis and J{\"o}rg Knappen and Elke Teich},
url = {https://www.researchgate.net/publication/331648534_Information_Density_in_Scientific_Writing_A_Diachronic_Perspective},
year = {2015},
date = {2015-07-01},
booktitle = {"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015)},
address = {RWTH Aachen University},
abstract = {

We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Crocker, Matthew W.; Demberg, Vera; Teich, Elke

Information Density and Linguistic Encoding (IDeaL) Journal Article

KI - Künstliche Intelligenz, 30, pp. 77-81, 2015.

Abstract
|
Links
|
BibTeX

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.

@article{crocker:demberg:teich,
title = {Information Density and Linguistic Encoding (IDeaL)},
author = {Matthew W. Crocker and Vera Demberg and Elke Teich},
url = {http://link.springer.com/article/10.1007/s13218-015-0391-y/fulltext.html},
doi = {https://doi.org/10.1007/s13218-015-0391-y},
year = {2015},
date = {2015},
journal = {KI - K{\"u}nstliche Intelligenz},
pages = {77-81},
volume = {30},
number = {1},
abstract = {

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects: A1 A3 B1