Publications

Degaetano-Ortlieb, Stefania

Stylistic Variation over 200 Years of Court. Proceedings According to Gender and Social Class Inproceedings

Proceedings of the 2nd Workshop on Stylistic Variation collocated with NAACL HLT 2018, June 1-6. ACL, Association for Computational Linguistics, pp. 1-10, New Orleans, 2018.

We present an approach to detect stylistic variation across social variables (here: gender and social class), considering also diachronic change in language use. For detection of stylistic variation, we use relative entropy, measuring the difference between probability distributions at different linguistic levels (here: lexis and grammar). In addition, by relative entropy, we can determine which linguistic units are related to stylistic variation.

@inproceedings{Degaetano-Ortlieb2018,
title = {Stylistic Variation over 200 Years of Court. Proceedings According to Gender and Social Class},
author = {Stefania Degaetano-Ortlieb},
url = {https://aclanthology.org/W18-1601},
doi = {https://doi.org/10.18653/v1/W18-1601},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 2nd Workshop on Stylistic Variation collocated with NAACL HLT 2018, June 1-6. ACL},
pages = {1-10},
publisher = {Association for Computational Linguistics},
address = {New Orleans},
abstract = {We present an approach to detect stylistic variation across social variables (here: gender and social class), considering also diachronic change in language use. For detection of stylistic variation, we use relative entropy, measuring the difference between probability distributions at different linguistic levels (here: lexis and grammar). In addition, by relative entropy, we can determine which linguistic units are related to stylistic variation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Fischer, Stefan; Knappen, Jörg; Teich, Elke

Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English Inproceedings

Proceedings of DH 2018, Mexico City, Mexico, 2018.

In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016). We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at https://fedora.clarin-d.uni-saarland.de/rsc/. As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.

@inproceedings{fischer-etal2018,
title = {Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English},
author = {Stefan Fischer and J{\"o}rg Knappen and Elke Teich},
url = {https://dh2018.adho.org/en/using-topic-modelling-to-explore-authors-research-fields-in-a-corpus-of-historical-scientific-english/},
year = {2018},
date = {2018},
booktitle = {Proceedings of DH 2018},
address = {Mexico City, Mexico},
abstract = {In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016). We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at https://fedora.clarin-d.uni-saarland.de/rsc/. As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin

Using diachronic corpora of scientific journal articles for complementing English corpus-based dictionaries and lexicographical resources for specialized languages Inproceedings

Proceedings of EURALEX2018, Ljubljana University Press, Faculty of Arts, Ljubljana, Slovenia, 2018, ISBN 978-961-06-0097-8.

As technology and science permeate nearly all areas of life in modern times, there is a certain trend for standard dictionaries to bolster their technical and scientific vocabulary and to identify more components, for instance more combining forms, in technical terms and terminological phrases. In this paper it is argued that recently built diachronic corpora of scientific journal articles with robust linguistic and metadata-based features are important resources for complementing English corpus-based dictionaries and lexicographical resources for specialized languages. The Royal Society Corpus (RSC, ca. 9,800 digitized texts, 32 million tokens) in combination with the Scientific Text Corpus (SciTex, ca. 5,000 documents, 39 million tokens), as two recently created corpus resources, offer the possibility to provide a fuller picture of the development of specialized vocabulary and of the number of meanings that general and technical terms have accumulated during their history. They facilitate the systematic identification of lexemes with specific linguistic characteristics or from selected disciplines and fields, and allow us to gain a better understanding of the development of academic writing in English scientific periodicals across several centuries, from their beginnings to the present day.

@inproceedings{Menzel2017b,
title = {Using diachronic corpora of scientific journal articles for complementing English corpus-based dictionaries and lexicographical resources for specialized languages},
author = {Katrin Menzel},
url = {https://euralex.org/publications/using-diachronic-corpora-of-scientific-journal-articles-for-complementing-english-corpus-based-dictionaries-and-lexicographical-resources-for-specialized-languages/},
year = {2018},
date = {2018},
booktitle = {Proceedings of EURALEX2018},
isbn = {978-961-06-0097-8},
publisher = {Ljubljana University Press, Faculty of Arts},
address = {Ljubljana, Slovenia},
abstract = {As technology and science permeate nearly all areas of life in modern times, there is a certain trend for standard dictionaries to bolster their technical and scientific vocabulary and to identify more components, for instance more combining forms, in technical terms and terminological phrases. In this paper it is argued that recently built diachronic corpora of scientific journal articles with robust linguistic and metadata-based features are important resources for complementing English corpus-based dictionaries and lexicographical resources for specialized languages. The Royal Society Corpus (RSC, ca. 9,800 digitized texts, 32 million tokens) in combination with the Scientific Text Corpus (SciTex, ca. 5,000 documents, 39 million tokens), as two recently created corpus resources, offer the possibility to provide a fuller picture of the development of specialized vocabulary and of the number of meanings that general and technical terms have accumulated during their history. They facilitate the systematic identification of lexemes with specific linguistic characteristics or from selected disciplines and fields, and allow us to gain a better understanding of the development of academic writing in English scientific periodicals across several centuries, from their beginnings to the present day.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania

Variation in language use across social variables: a data-driven approach Inproceedings

Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER), Bari, Italy, 2017.

We present a data-driven approach to study language use over time according to social variables (henceforth SV), considering also interactions between different variables. Besides sociolinguistic studies on language variation according to SVs (e.g., Weinreich et al. 1968, Bernstein 1971, Eckert 1989, Milroy and Milroy 1985), recently computational approaches have gained prominence (see e.g., Eisenstein 2015, Danescu-Niculescu-Mizil et al. 2013, and Nguyen et al. 2017 for an overview), not least due to an increase in data availability based on social media and an increasing awareness of the importance of linguistic variation according to SVs in the NLP community.

@inproceedings{Degaetano-Ortlieb2017b,
title = {Variation in language use across social variables: a data-driven approach},
author = {Stefania Degaetano-Ortlieb},
url = {https://stefaniadegaetano.files.wordpress.com/2017/07/clavier2017_slingpro_accepted.pdf},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER)},
address = {Bari, Italy},
abstract = {We present a data-driven approach to study language use over time according to social variables (henceforth SV), considering also interactions between different variables. Besides sociolinguistic studies on language variation according to SVs (e.g., Weinreich et al. 1968, Bernstein 1971, Eckert 1989, Milroy and Milroy 1985), recently computational approaches have gained prominence (see e.g., Eisenstein 2015, Danescu-Niculescu-Mizil et al. 2013, and Nguyen et al. 2017 for an overview), not least due to an increase in data availability based on social media and an increasing awareness of the importance of linguistic variation according to SVs in the NLP community.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

The course of grammatical change in scientific writing: Interdependency between convention and productivity Inproceedings

Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER), Bari, Italy, 2017.

We present an empirical approach to analyze the course of usage change in scientific writing. A great amount of linguistic research has dealt with grammatical changes, showing their gradual course of change, which nearly always progresses stepwise (see e.g. Bybee et al. 1994, Hopper and Traugott 2003, Lee 2011, De Smet and Van de Velde 2013). Less well understood is under which conditions these changes occur. According to De Smet (2016), specific expressions increase in frequency in one grammatical context, adopting a more conventionalized use, which in turn makes them available in closely related grammatical contexts.

@inproceedings{Degaetano-Ortlieb2017b,
title = {The course of grammatical change in scientific writing: Interdependency between convention and productivity},
author = {Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2017/07/clavier2017-degaetano-etal_accepted_final.pdf},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER)},
address = {Bari, Italy},
abstract = {We present an empirical approach to analyze the course of usage change in scientific writing. A great amount of linguistic research has dealt with grammatical changes, showing their gradual course of change, which nearly always progresses stepwise (see e.g. Bybee et al. 1994, Hopper and Traugott 2003, Lee 2011, De Smet and Van de Velde 2013). Less well understood is under which conditions these changes occur. According to De Smet (2016), specific expressions increase in frequency in one grammatical context, adopting a more conventionalized use, which in turn makes them available in closely related grammatical contexts.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Menzel, Katrin; Degaetano-Ortlieb, Stefania

The diachronic development of combining forms in scientific writing Journal Article

Lege Artis. Language yesterday, today, tomorrow. The Journal of University of SS Cyril and Methodius in Trnava. Warsaw: De Gruyter Open, 2, pp. 185-249, 2017.
This paper addresses the diachronic development of combining forms in English scientific texts over approximately 350 years, from the early stages of the first scholarly journals that were published in English to contemporary English scientific publications. In this paper a critical discussion of the category of combining forms is presented and a case study is produced to examine the role of selected combining forms in two diachronic English corpora.

@article{Menzel2017,
title = {The diachronic development of combining forms in scientific writing},
author = {Katrin Menzel and Stefania Degaetano-Ortlieb},
url = {https://www.researchgate.net/publication/321776056_The_diachronic_development_of_combining_forms_in_scientific_writing},
year = {2017},
date = {2017},
journal = {Lege Artis. Language yesterday, today, tomorrow. The Journal of University of SS Cyril and Methodius in Trnava. Warsaw: De Gruyter Open},
pages = {185-249},
volume = {2},
number = {2},
abstract = {

This paper addresses the diachronic development of combining forms in English scientific texts over approximately 350 years, from the early stages of the first scholarly journals that were published in English to contemporary English scientific publications. In this paper a critical discussion of the category of combining forms is presented and a case study is produced to examine the role of selected combining forms in two diachronic English corpora.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Fischer, Stefan; Demberg, Vera; Teich, Elke

An information-theoretic account on the diachronic development of discourse connectors in scientific writing Inproceedings

39th DGfS AG1, Saarbrücken, Germany, 2017.

@inproceedings{Degaetano-Ortlieb2017b,
title = {An information-theoretic account on the diachronic development of discourse connectors in scientific writing},
author = {Stefania Degaetano-Ortlieb and Stefan Fischer and Vera Demberg and Elke Teich},
year = {2017},
date = {2017},
publisher = {39th DGfS AG1},
address = {Saarbr{\"u}cken, Germany},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Knappen, Jörg; Fischer, Stefan; Kermes, Hannah; Teich, Elke; Fankhauser, Peter

The making of the Royal Society Corpus Inproceedings

21st Nordic Conference on Computational Linguistics (NoDaLiDa) Workshop on Processing Historical language, Workshop on Processing Historical language, pp. 7-11, Gothenburg, Sweden, 2017.
The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of part-of-speech tagging.

@inproceedings{Knappen2017,
title = {The making of the Royal Society Corpus},
author = {J{\"o}rg Knappen and Stefan Fischer and Hannah Kermes and Elke Teich and Peter Fankhauser},
url = {https://www.researchgate.net/publication/331648134_The_Making_of_the_Royal_Society_Corpus},
year = {2017},
date = {2017},
booktitle = {21st Nordic Conference on Computational Linguistics (NoDaLiDa) Workshop on Processing Historical language},
pages = {7-11},
publisher = {Workshop on Processing Historical language},
address = {Gothenburg, Sweden},
abstract = {

The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of part-of-speech tagging.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kermes, Hannah; Teich, Elke

Average surprisal of parts-of-speech Inproceedings

Corpus Linguistics 2017, Birmingham, UK, 2017.

We present an approach to investigate the differences between lexical words and function words and the respective parts-of-speech from an information-theoretical point of view (cf. Shannon, 1949). We use average surprisal (AvS) to measure the amount of information transmitted by a linguistic unit. We expect to find function words to be more predictable (having a lower AvS) and lexical words to be less predictable (having a higher AvS). We also assume that function words‘ AvS is fairly constant over time and registers, while AvS of lexical words is more variable depending on time and register.

@inproceedings{Kermes2017,
title = {Average surprisal of parts-of-speech},
author = {Hannah Kermes and Elke Teich},
url = {https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper207.pdf},
year = {2017},
date = {2017},
publisher = {Corpus Linguistics 2017},
address = {Birmingham, UK},
abstract = {We present an approach to investigate the differences between lexical words and function words and the respective parts-of-speech from an information-theoretical point of view (cf. Shannon, 1949). We use average surprisal (AvS) to measure the amount of information transmitted by a linguistic unit. We expect to find function words to be more predictable (having a lower AvS) and lexical words to be less predictable (having a higher AvS). We also assume that function words' AvS is fairly constant over time and registers, while AvS of lexical words is more variable depending on time and register.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Modeling intra-textual variation with entropy and surprisal: Topical vs. stylistic patterns Inproceedings

Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics, pp. 68-77, Vancouver, Canada, 2017.

We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).

@inproceedings{Degaetano-Ortlieb2017,
title = {Modeling intra-textual variation with entropy and surprisal: Topical vs. stylistic patterns},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W17-2209},
doi = {https://doi.org/10.18653/v1/W17-2209},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature},
pages = {68-77},
publisher = {Association for Computational Linguistics},
address = {Vancouver, Canada},
abstract = {We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kermes, Hannah; Degaetano-Ortlieb, Stefania; Knappen, Jörg; Khamis, Ashraf; Teich, Elke

The Royal Society Corpus: From Uncharted Data to Corpus Inproceedings

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), European Language Resources Association (ELRA), pp. 1928-1931, Portorož, Slovenia, 2016.

We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665-1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.

@inproceedings{Kermes2016,
title = {The Royal Society Corpus: From Uncharted Data to Corpus},
author = {Hannah Kermes and Stefania Degaetano-Ortlieb and J{\"o}rg Knappen and Ashraf Khamis and Elke Teich},
url = {https://aclanthology.org/L16-1305},
year = {2016},
date = {2016},
booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)},
pages = {1928-1931},
publisher = {European Language Resources Association (ELRA)},
address = {Portoro{\v{z}, Slovenia},
abstract = {We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665-1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Fankhauser, Peter; Knappen, Jörg; Teich, Elke

Topical Diversification over Time in the Royal Society Corpus Inproceedings

Proceedings of Digital Humanities (DH'16)Proceedings of Digital Humanities (DH'16), Krakow, Poland, 2016.

Science gradually developed into an established sociocultural domain starting from the mid-17th century onwards. In this process it became increasingly specialized and diversified. Here, we investigate a particular aspect of specialization on the basis of probabilistic topic models. As a corpus we use the Royal Society Corpus (Khamis et al. 2015), which covers the period from 1665 to 1869 and contains 9015 documents.

@inproceedings{Fankhauser2016,
title = {Topical Diversification over Time in the Royal Society Corpus},
author = {Peter Fankhauser and J{\"o}rg Knappen and Elke Teich},
url = {https://www.semanticscholar.org/paper/Topical-Diversification-Over-Time-In-The-Royal-Fankhauser-Knappen/7f7dce0d0b8209d0c841c8da031614fccb97a787},
year = {2016},
date = {2016},
booktitle = {Proceedings of Digital Humanities (DH'16)},
address = {Krakow, Poland},
abstract = {Science gradually developed into an established sociocultural domain starting from the mid-17th century onwards. In this process it became increasingly specialized and diversified. Here, we investigate a particular aspect of specialization on the basis of probabilistic topic models. As a corpus we use the Royal Society Corpus (Khamis et al. 2015), which covers the period from 1665 to 1869 and contains 9015 documents.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Kermes, Hannah; Knappen, Jörg; Khamis, Ashraf; Degaetano-Ortlieb, Stefania; Teich, Elke

The Royal Society Corpus. Towards a high-quality resource for studying diachronic variation in scientific writing Inproceedings

Proceedings of Digital Humanities (DH'16), Krakow, Poland, 2016.
We introduce a diachronic corpus of English scientific writing – the Royal Society Corpus (RSC) – adopting a middle ground between big and ‘poor’ and small and ‘rich’ data. The corpus has been built from an electronic version of the Transactions and Proceedings of the Royal Society of London and comprises c. 35 million tokens from the period 1665-1869 (see Table 1). The motivation for building a corpus from this material is to investigate the diachronic development of written scientific English.

@inproceedings{Kermes2016a,
title = {The Royal Society Corpus. Towards a high-quality resource for studying diachronic variation in scientific writing},
author = {Hannah Kermes and J{\"o}rg Knappen and Ashraf Khamis and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://www.researchgate.net/publication/331648262_The_Royal_Society_Corpus_Towards_a_high-quality_corpus_for_studying_diachronic_variation_in_scientific_writing},
year = {2016},
date = {2016},
booktitle = {Proceedings of Digital Humanities (DH'16)},
address = {Krakow, Poland},
abstract = {

We introduce a diachronic corpus of English scientific writing - the Royal Society Corpus (RSC) - adopting a middle ground between big and ‘poor’ and small and ‘rich’ data. The corpus has been built from an electronic version of the Transactions and Proceedings of the Royal Society of London and comprises c. 35 million tokens from the period 1665-1869 (see Table 1). The motivation for building a corpus from this material is to investigate the diachronic development of written scientific English.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Information-based modeling of diachronic linguistic change: from typicality to productivity Inproceedings

Proceedings of Language Technologies for the Socio-Economic Sciences and Humanities (LATECH'16), Association for Computational Linguistics (ACL), Association for Computational Linguistics, pp. 165-173, Berlin, Germany, 2016.

We present a new approach for modeling diachronic linguistic change in grammatical usage. We illustrate the approach on English scientific writing in Late Modern English, focusing on grammatical patterns that are potentially indicative of shifts in register, genre and/or style. Commonly, diachronic change is characterized by the relative frequency of typical linguistic features over time. However, to fully capture changing linguistic usage, feature productivity needs to be taken into account as well. We introduce a data-driven approach for systematically detecting typical features and assessing their productivity over time, using information-theoretic
measures of entropy and surprisal.

@inproceedings{Degaetano-Ortlieb2016a,
title = {Information-based modeling of diachronic linguistic change: from typicality to productivity},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W16-2121},
doi = {https://doi.org/10.18653/v1/W16-2121},
year = {2016},
date = {2016},
booktitle = {Proceedings of Language Technologies for the Socio-Economic Sciences and Humanities (LATECH'16), Association for Computational Linguistics (ACL)},
pages = {165-173},
publisher = {Association for Computational Linguistics},
address = {Berlin, Germany},
abstract = {We present a new approach for modeling diachronic linguistic change in grammatical usage. We illustrate the approach on English scientific writing in Late Modern English, focusing on grammatical patterns that are potentially indicative of shifts in register, genre and/or style. Commonly, diachronic change is characterized by the relative frequency of typical linguistic features over time. However, to fully capture changing linguistic usage, feature productivity needs to be taken into account as well. We introduce a data-driven approach for systematically detecting typical features and assessing their productivity over time, using information-theoretic measures of entropy and surprisal.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Ordan, Noam; Teich, Elke

The taming of the data: Using text mining in building a corpus for diachronic analysis Inproceedings

Varieng - From Data to Evidence (d2e), University of Helsinki, 2015.

Social and historical linguistic studies benefit from corpora encoding contextual metadata (e.g. time, register, genre) and relevant structural information (e.g. document structure). While small, handcrafted corpora control over selected contextual variables (e.g. the Brown/LOB corpora encoding variety, register, and time) and are readily usable for analysis, big data (e.g. Google or Microsoft n-grams) are typically poorly contextualized and considered of limited value for linguistic analysis (see, however, Lieberman et al. 2007). Similarly, when we compile new corpora, sources may not contain all relevant metadata and structural data (e.g. the Old Bailey sources vs. the richly annotated corpus in Huber 2007).

@inproceedings{Degaetano-etal2015,
title = {The taming of the data: Using text mining in building a corpus for diachronic analysis},
author = {Stefania Degaetano-Ortlieb and Hannah Kermes and Ashraf Khamis and Noam Ordan and Elke Teich},
url = {https://www.ashrafkhamis.com/d2e2015.pdf},
year = {2015},
date = {2015-10-01},
booktitle = {Varieng - From Data to Evidence (d2e)},
address = {University of Helsinki},
abstract = {Social and historical linguistic studies benefit from corpora encoding contextual metadata (e.g. time, register, genre) and relevant structural information (e.g. document structure). While small, handcrafted corpora control over selected contextual variables (e.g. the Brown/LOB corpora encoding variety, register, and time) and are readily usable for analysis, big data (e.g. Google or Microsoft n-grams) are typically poorly contextualized and considered of limited value for linguistic analysis (see, however, Lieberman et al. 2007). Similarly, when we compile new corpora, sources may not contain all relevant metadata and structural data (e.g. the Old Bailey sources vs. the richly annotated corpus in Huber 2007).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Khamis, Ashraf; Degaetano-Ortlieb, Stefania; Kermes, Hannah; Knappen, Jörg; Ordan, Noam; Teich, Elke

A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus Inproceedings

Corpus Linguistics 2015, Lancaster, 2015.
There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.

@inproceedings{Khamis-etal2015,
title = {A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus},
author = {Ashraf Khamis and Stefania Degaetano-Ortlieb and Hannah Kermes and J{\"o}rg Knappen and Noam Ordan and Elke Teich},
url = {https://www.researchgate.net/publication/331648570_A_resource_for_the_diachronic_study_of_scientific_English_Introducing_the_Royal_Society_Corpus},
year = {2015},
date = {2015-07-01},
booktitle = {Corpus Linguistics 2015},
address = {Lancaster},
abstract = {

There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Knappen, Jörg; Teich, Elke

Information Density in Scientific Writing: A Diachronic Perspective Inproceedings

"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015), RWTH Aachen University, 2015.
We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).

@inproceedings{Degaetano-etal2015b,
title = {Information Density in Scientific Writing: A Diachronic Perspective},
author = {Stefania Degaetano-Ortlieb and Hannah Kermes and Ashraf Khamis and J{\"o}rg Knappen and Elke Teich},
url = {https://www.researchgate.net/publication/331648534_Information_Density_in_Scientific_Writing_A_Diachronic_Perspective},
year = {2015},
date = {2015-07-01},
booktitle = {"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015)},
address = {RWTH Aachen University},
abstract = {

We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Crocker, Matthew W.; Demberg, Vera; Teich, Elke

Information Density and Linguistic Encoding (IDeaL) Journal Article

KI - Künstliche Intelligenz, 30, pp. 77-81, 2015.

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.

@article{crocker:demberg:teich,
title = {Information Density and Linguistic Encoding (IDeaL)},
author = {Matthew W. Crocker and Vera Demberg and Elke Teich},
url = {http://link.springer.com/article/10.1007/s13218-015-0391-y/fulltext.html},
doi = {https://doi.org/10.1007/s13218-015-0391-y},
year = {2015},
date = {2015},
journal = {KI - K{\"u}nstliche Intelligenz},
pages = {77-81},
volume = {30},
number = {1},
abstract = {

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   A1 A3 B1

Successfully