Publications - SFB 1102

65 entries

Bizzoni, Yuri; Mosbach, Marius; Klakow, Dietrich; Degaetano-Ortlieb, Stefania

Some steps towards the generation of diachronic WordNets Inproceedings

Proceedings of the 22nd Nordic Conference on Computational Linguistics, Linköping University Electronic Press, Turku, Finland, 2019.

Abstract
|
Links
|
BibTeX

We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task.

Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.

https://www.aclweb.org/anthology/W19-6106

@inproceedings{bizzoni-etal-2019-steps,
title = {Some steps towards the generation of diachronic WordNets},
author = {Yuri Bizzoni and Marius Mosbach and Dietrich Klakow and Stefania Degaetano-Ortlieb},
url = {https://www.aclweb.org/anthology/W19-6106},
year = {2019},
date = {2019-10-02},
booktitle = {Proceedings of the 22nd Nordic Conference on Computational Linguistics},
publisher = {Link{\"o}ping University Electronic Press},
address = {Turku, Finland},
abstract = {We apply hyperbolic embeddings to trace the dynamics of change of conceptual-semantic relationships in a large diachronic scientific corpus (200 years). Our focus is on emerging scientific fields and the increasingly specialized terminology establishing around them. Reproducing high-quality hierarchical structures such as WordNet on a diachronic scale is a very difficult task. Hyperbolic embeddings can map partial graphs into low dimensional, continuous hierarchical spaces, making more explicit the latent structure of the input. We show that starting from simple lists of word pairs (rather than a list of entities with directional links) it is possible to build diachronic hierarchical semantic spaces which allow us to model a process towards specialization for selected scientific fields.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Bizzoni, Yuri; Degaetano-Ortlieb, Stefania; Menzel, Katrin; Krielke, Marie-Pauline; Teich, Elke

Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings Inproceedings

Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, Association for Computational Linguistics, pp. 175-185, Florence, Italy, 2019.

Abstract
|
Links
|
BibTeX

The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.

@inproceedings{Bizzoni2019,
title = {Grammar and Meaning: Analysing the Topology of Diachronic Word Embeddings},
author = {Yuri Bizzoni and Stefania Degaetano-Ortlieb and Katrin Menzel and Marie-Pauline Krielke and Elke Teich},
url = {https://aclanthology.org/W19-4722},
doi = {https://doi.org/10.18653/v1/W19-4722},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change},
pages = {175-185},
publisher = {Association for Computational Linguistics},
address = {Florence, Italy},
abstract = {The paper showcases the application of word embeddings to change in language use in the domain of science, focusing on the Late Modern English period (17-19th century). Historically, this is the period in which many registers of English developed, including the language of science. Our overarching interest is the linguistic development of scientific writing to a distinctive (group of) register(s). A register is marked not only by the choice of lexical words (discourse domain) but crucially by grammatical choices which indicate style. The focus of the paper is on the latter, tracing words with primarily grammatical functions (function words and some selected, poly-functional word forms) diachronically. To this end, we combine diachronic word embeddings with appropriate visualization and exploratory techniques such as clustering and relative entropy for meaningful aggregation of data and diachronic comparison.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Menzel, Katrin

Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS) , Bremen, Germany, 2019.

Abstract
|
Links
|
BibTeX

This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.

@inproceedings{Menzel2019,
title = {Daltonian atoms, Steiner's curve and Voltaic sparks - the role of eponymous terms in a diachronic corpus of English scientific writing},
author = {Katrin Menzel},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Menzel.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {This poster has a focus on eponymous academic and scientific terms in the first 200 years of the Royal Society Corpus (RSC, ca. 9,800 English scientific journal articles from the Royal Society of London, 1665-1869, cf. Kermes et al. 2016). It is annotated at different linguistic levels and provides a number of query and visualization options. Various types of metadata are encoded for each text, e.g. text topics / academic disciplines. This dataset contains a variety of eponymous terms named after English, foreign and classical scholars and inventors. The poster presents the results of a corpus study on eponymous terms with common structural features such as multiword terms with similar part of speech patterns (e.g. adjective + noun constructions such as Newtonian telescope) and terms with shared morphological elements, e.g. those that contain possessive markers (e.g. Steiner’s curve) or identical derivational affixes (e.g. Bezoutic, Hippocratic). Queries have been developed to automatically retrieve these terms from the corpus and the results were manually filtered afterwards. There are, for instance, around 3,000 eponymous adjective + noun constructions derived from ca. 160 different names of scholars. Some are used as titles for institutions or academic events, positions and honours (e.g. Plumian Professor, Jacksonian prize) while most refer to scientific concepts and discoveries (e.g. Daltonian atoms, Voltaic sparks). The terms show specific distribution patterns within and across documents. It can be observed how such terms have developed when English became established as a language of science and scholarship and what role they played throughout the following centuries. The analysis of these terms also contributes to reconstructing cultural aspects and language contacts in various scientific fields and time periods. Additionally, the results can be used to complement English lexicographical resources for specialized languages (cf. also Menzel 2018) and they contribute to a growing understanding of diachronic and cross-linguistic aspects of term formation processes.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Fischer, Stefan; Teich, Elke

More complex or just more diverse? Capturing diachronic linguistic variation Inproceedings

41. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft (DGfS), Bremen, Germany, 2019.

Abstract
|
Links
|
BibTeX

We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).

@inproceedings{Fischer2019,
title = {More complex or just more diverse? Capturing diachronic linguistic variation},
author = {Stefan Fischer and Elke Teich},
url = {http://www.dgfs2019.uni-bremen.de/abstracts/poster/Fischer_Teich.pdf},
year = {2019},
date = {2019-03-06},
publisher = {41. Jahrestagung der Deutschen Gesellschaft f{\"u}r Sprachwissenschaft (DGfS)},
address = {Bremen, Germany},
abstract = {We present a diachronic comparison of general (register-mixed) and scientific English in the late modern period (1700–1900). For our analysis we use two corpora which are comparable in size and time-span: the Corpus of Late Modern English (CLMET; De Smet et al. 2015) and the Royal Society Corpus (RSC; Kermes et al. 2016). Previous studies of scientific English found a diachronic tendency from a verbal, involved to a more nominal, abstract style compared to other discourse types (cf. Halliday 1988; Biber & Gray 2011). The features reported include type-token ratio, lexical density, number of words per sentence and relative frequency of nominal vs. verbal categories—all potential indicators of linguistic complexity at a shallow level. We present results for these common measures on our data set as well as for selected information-theoretic measures, notably relative entropy (Kullback–Leibler divergence: KLD) and surprisal. For instance, using KLD, we observe a continuous divergence between general and scientific language based on word unigrams as well as part-of-speech trigrams. Lexical density increases over time for both scientific language and general language. In both corpora, sentence length decreases by roughly 25%, with scientific sentences being longer on average. On the other hand, mean sentence surprisal remains stable over time. The poster will give an overview of our results using the selected measures and discuss possible interpretations. Moreover, we will assess their utility for capturing linguistic diversification, showing that the information-theoretic measures are fairly fine-tuned, robust and link up well to explanations in terms of linguistic complexity and rational communication (cf. Hale 2016; Crocker, Demberg, & Teich 2016).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Using relative entropy for detection and analysis of periods of diachronic linguistic change Inproceedings

Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018, Association for Computational Linguistics , pp. 22-33, Santa Fe, New Mexico, 2018.

Abstract
|
Links
|
BibTeX

We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.

latech2018_final (0.54MB)
https://aclanthology.org/W18-4503

@inproceedings{Degaetano-Ortlieb2018b,
title = {Using relative entropy for detection and analysis of periods of diachronic linguistic change},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W18-4503},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at COLING2018},
pages = {22-33},
publisher = {Association for Computational Linguistics},
address = {Santa Fe, New Mexico},
abstract = {We present a data-driven approach to detect periods of linguistic change and the lexical and grammatical features contributing to change. We focus on the development of scientific English in the late modern period. Our approach is based on relative entropy (Kullback-Leibler Divergence) comparing temporally adjacent periods and sliding over the time line from past to present. Using a diachronic corpus of scientific publications of the Royal Society of London, we show how periods of change reflect the interplay between lexis and grammar, where periods of lexical expansion are typically followed by periods of grammatical consolidation resulting in a balance between expressivity and communicative efficiency. Our method is generic and can be applied to other data sets, languages and time ranges.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Teich, Elke; Fankhauser, Peter

Aspects of Linguistic and Computational Modeling in Language Science Book Chapter

Flanders, Julia; Jannidis, Fotis (Ed.): The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities). , Routledge, Taylor & Francis, pp. 236-249, New York, 2018.

Abstract
|
Links
|
BibTeX

Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.

https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/34320

@inbook{Teich2018,
title = {Aspects of Linguistic and Computational Modeling in Language Science},
author = {Elke Teich and Peter Fankhauser},
editor = {Julia Flanders and Fotis Jannidis},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/34320},
year = {2018},
date = {2018},
booktitle = {The Shape of Data in Digital Humanities. Modeling Texts and Text-based Resources. (Digital Research in the Arts and Humanities).},
pages = {236-249},
publisher = {Routledge, Taylor & Francis},
address = {New York},
abstract = {Linguistics is concerned with modeling language from the cognitive, social, and historical perspectives. When practiced as a science, linguistics is characterized by the tension between the two methodological dispositions of rationalism and empiricism. At any point in time in the history of linguistics, one is more dominant than the other. In the last two decades, we have been experiencing a new wave of empiricism in linguistic fields as diverse as psycholinguistics (e.g., Chater et al., 2015), language typology (e.g., Piantidosi and Gibson, 2014), language change (e.g., Bybee, 2010) and language variation (e.g., Bresnan and Ford, 2010). Consequently, the practices of modeling are being renegotiated in different linguistic communities, readdressing some fundamental methodological questions such as: How to cast a research question into an appropriate study design? How to obtain evidence (data) for a hypothesis (e.g., experiment vs. corpus)? How to process the data? How to evaluate a hypothesis in the light of the data obtained? This new empiricism is characterized by an interest in language use in context accompanied by a commitment to computational modeling, which is probably most developed in psycholinguistics, giving rise to the field of “computational psycholinguistics” (cf. Crocker, 2010), but recently getting stronger also in corpus linguistics.},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Strötgen, Jannik

Diachronic variation of temporal expressions in scientific writing through the lens of relative entropy Inproceedings

Rehm, Georg; Declerck, Thierry (Ed.): Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, September 13-14, Proceedings. Lecture Notes in Computer Science, 10713, Springer International Publishing, pp. 250-275, Berlin, Germany, 2018.

Abstract
|
Links
|
BibTeX

The abundance of temporal information in documents has lead to an increased interest in processing such information in the NLP community by considering temporal expressions. Besides domain-adaptation, acquiring knowledge on variation of temporal expressions according to time is relevant for improvement in automatic processing. So far, frequency-based accounts dominate in the investigation of specific temporal expressions. We present an approach to investigate diachronic changes of temporal expressions based on relative entropy – with the advantage of using conditioned probabilities rather than mere frequency. While we focus on scientific writing, our approach is generalizable to other domains and interesting not only in the field of NLP, but also in humanities.

https://link.springer.com/chapter/10.1007/978-3-319-73706-5_22

@inproceedings{Degaetano-Ortlieb2018b,
title = {Diachronic variation of temporal expressions in scientific writing through the lens of relative entropy},
author = {Stefania Degaetano-Ortlieb and Jannik Str{\"o}tgen},
editor = {Georg Rehm and Thierry Declerck},
url = {https://link.springer.com/chapter/10.1007/978-3-319-73706-5_22},
year = {2018},
date = {2018},
booktitle = {Language Technologies for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, September 13-14, Proceedings. Lecture Notes in Computer Science},
pages = {250-275},
publisher = {Springer International Publishing},
address = {Berlin, Germany},
abstract = {The abundance of temporal information in documents has lead to an increased interest in processing such information in the NLP community by considering temporal expressions. Besides domain-adaptation, acquiring knowledge on variation of temporal expressions according to time is relevant for improvement in automatic processing. So far, frequency-based accounts dominate in the investigation of specific temporal expressions. We present an approach to investigate diachronic changes of temporal expressions based on relative entropy – with the advantage of using conditioned probabilities rather than mere frequency. While we focus on scientific writing, our approach is generalizable to other domains and interesting not only in the field of NLP, but also in humanities.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania

Stylistic Variation over 200 Years of Court. Proceedings According to Gender and Social Class Inproceedings

Proceedings of the 2nd Workshop on Stylistic Variation collocated with NAACL HLT 2018, June 1-6. ACL, Association for Computational Linguistics, pp. 1-10, New Orleans, 2018.

Abstract
|
Links
|
BibTeX

We present an approach to detect stylistic variation across social variables (here: gender and social class), considering also diachronic change in language use. For detection of stylistic variation, we use relative entropy, measuring the difference between probability distributions at different linguistic levels (here: lexis and grammar). In addition, by relative entropy, we can determine which linguistic units are related to stylistic variation.

@inproceedings{Degaetano-Ortlieb2018,
title = {Stylistic Variation over 200 Years of Court. Proceedings According to Gender and Social Class},
author = {Stefania Degaetano-Ortlieb},
url = {https://aclanthology.org/W18-1601},
doi = {https://doi.org/10.18653/v1/W18-1601},
year = {2018},
date = {2018},
booktitle = {Proceedings of the 2nd Workshop on Stylistic Variation collocated with NAACL HLT 2018, June 1-6. ACL},
pages = {1-10},
publisher = {Association for Computational Linguistics},
address = {New Orleans},
abstract = {We present an approach to detect stylistic variation across social variables (here: gender and social class), considering also diachronic change in language use. For detection of stylistic variation, we use relative entropy, measuring the difference between probability distributions at different linguistic levels (here: lexis and grammar). In addition, by relative entropy, we can determine which linguistic units are related to stylistic variation.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Fischer, Stefan; Knappen, Jörg; Teich, Elke

Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English Inproceedings

Proceedings of DH 2018, Mexico City, Mexico, 2018.

Abstract
|
Links
|
BibTeX

In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016). We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at https://fedora.clarin-d.uni-saarland.de/rsc/. As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.

https://dh2018.adho.org/en/using-topic-modelling-to-explore-authors-research-fields-in-a-corpus-of-historical-scientific-english/

@inproceedings{fischer-etal2018,
title = {Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English},
author = {Stefan Fischer and J{\"o}rg Knappen and Elke Teich},
url = {https://dh2018.adho.org/en/using-topic-modelling-to-explore-authors-research-fields-in-a-corpus-of-historical-scientific-english/},
year = {2018},
date = {2018},
booktitle = {Proceedings of DH 2018},
address = {Mexico City, Mexico},
abstract = {In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016). We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at https://fedora.clarin-d.uni-saarland.de/rsc/. As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Menzel, Katrin

Using diachronic corpora of scientific journal articles for complementing English corpus-based dictionaries and lexicographical resources for specialized languages Inproceedings

Proceedings of EURALEX2018, Ljubljana University Press, Faculty of Arts, Ljubljana, Slovenia, 2018, ISBN 978-961-06-0097-8.

Abstract
|
Links
|
BibTeX

As technology and science permeate nearly all areas of life in modern times, there is a certain trend for standard dictionaries to bolster their technical and scientific vocabulary and to identify more components, for instance more combining forms, in technical terms and terminological phrases. In this paper it is argued that recently built diachronic corpora of scientific journal articles with robust linguistic and metadata-based features are important resources for complementing English corpus-based dictionaries and lexicographical resources for specialized languages. The Royal Society Corpus (RSC, ca. 9,800 digitized texts, 32 million tokens) in combination with the Scientific Text Corpus (SciTex, ca. 5,000 documents, 39 million tokens), as two recently created corpus resources, offer the possibility to provide a fuller picture of the development of specialized vocabulary and of the number of meanings that general and technical terms have accumulated during their history. They facilitate the systematic identification of lexemes with specific linguistic characteristics or from selected disciplines and fields, and allow us to gain a better understanding of the development of academic writing in English scientific periodicals across several centuries, from their beginnings to the present day.

https://euralex.org/publications/using-diachronic-corpora-of-scientific-journal-articles-for-complementing-english-corpus-based-dictionaries-and-lexicographical-resources-for-specialized-languages/

@inproceedings{Menzel2017b,
title = {Using diachronic corpora of scientific journal articles for complementing English corpus-based dictionaries and lexicographical resources for specialized languages},
author = {Katrin Menzel},
url = {https://euralex.org/publications/using-diachronic-corpora-of-scientific-journal-articles-for-complementing-english-corpus-based-dictionaries-and-lexicographical-resources-for-specialized-languages/},
year = {2018},
date = {2018},
booktitle = {Proceedings of EURALEX2018},
isbn = {978-961-06-0097-8},
publisher = {Ljubljana University Press, Faculty of Arts},
address = {Ljubljana, Slovenia},
abstract = {As technology and science permeate nearly all areas of life in modern times, there is a certain trend for standard dictionaries to bolster their technical and scientific vocabulary and to identify more components, for instance more combining forms, in technical terms and terminological phrases. In this paper it is argued that recently built diachronic corpora of scientific journal articles with robust linguistic and metadata-based features are important resources for complementing English corpus-based dictionaries and lexicographical resources for specialized languages. The Royal Society Corpus (RSC, ca. 9,800 digitized texts, 32 million tokens) in combination with the Scientific Text Corpus (SciTex, ca. 5,000 documents, 39 million tokens), as two recently created corpus resources, offer the possibility to provide a fuller picture of the development of specialized vocabulary and of the number of meanings that general and technical terms have accumulated during their history. They facilitate the systematic identification of lexemes with specific linguistic characteristics or from selected disciplines and fields, and allow us to gain a better understanding of the development of academic writing in English scientific periodicals across several centuries, from their beginnings to the present day.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania

Variation in language use across social variables: a data-driven approach Inproceedings

Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER), Bari, Italy, 2017.

Abstract
|
Links
|
BibTeX

We present a data-driven approach to study language use over time according to social variables (henceforth SV), considering also interactions between different variables. Besides sociolinguistic studies on language variation according to SVs (e.g., Weinreich et al. 1968, Bernstein 1971, Eckert 1989, Milroy and Milroy 1985), recently computational approaches have gained prominence (see e.g., Eisenstein 2015, Danescu-Niculescu-Mizil et al. 2013, and Nguyen et al. 2017 for an overview), not least due to an increase in data availability based on social media and an increasing awareness of the importance of linguistic variation according to SVs in the NLP community.

@inproceedings{Degaetano-Ortlieb2017b,
title = {Variation in language use across social variables: a data-driven approach},
author = {Stefania Degaetano-Ortlieb},
url = {https://stefaniadegaetano.files.wordpress.com/2017/07/clavier2017_slingpro_accepted.pdf},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER)},
address = {Bari, Italy},
abstract = {We present a data-driven approach to study language use over time according to social variables (henceforth SV), considering also interactions between different variables. Besides sociolinguistic studies on language variation according to SVs (e.g., Weinreich et al. 1968, Bernstein 1971, Eckert 1989, Milroy and Milroy 1985), recently computational approaches have gained prominence (see e.g., Eisenstein 2015, Danescu-Niculescu-Mizil et al. 2013, and Nguyen et al. 2017 for an overview), not least due to an increase in data availability based on social media and an increasing awareness of the importance of linguistic variation according to SVs in the NLP community.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

The course of grammatical change in scientific writing: Interdependency between convention and productivity Inproceedings

Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER), Bari, Italy, 2017.

Abstract
|
Links
|
BibTeX

We present an empirical approach to analyze the course of usage change in scientific writing. A great amount of linguistic research has dealt with grammatical changes, showing their gradual course of change, which nearly always progresses stepwise (see e.g. Bybee et al. 1994, Hopper and Traugott 2003, Lee 2011, De Smet and Van de Velde 2013). Less well understood is under which conditions these changes occur. According to De Smet (2016), specific expressions increase in frequency in one grammatical context, adopting a more conventionalized use, which in turn makes them available in closely related grammatical contexts.

@inproceedings{Degaetano-Ortlieb2017b,
title = {The course of grammatical change in scientific writing: Interdependency between convention and productivity},
author = {Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2017/07/clavier2017-degaetano-etal_accepted_final.pdf},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Corpus and Language Variation in English Research Conference (CLAVIER)},
address = {Bari, Italy},
abstract = {We present an empirical approach to analyze the course of usage change in scientific writing. A great amount of linguistic research has dealt with grammatical changes, showing their gradual course of change, which nearly always progresses stepwise (see e.g. Bybee et al. 1994, Hopper and Traugott 2003, Lee 2011, De Smet and Van de Velde 2013). Less well understood is under which conditions these changes occur. According to De Smet (2016), specific expressions increase in frequency in one grammatical context, adopting a more conventionalized use, which in turn makes them available in closely related grammatical contexts.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Menzel, Katrin; Degaetano-Ortlieb, Stefania

The diachronic development of combining forms in scientific writing Journal Article

Lege Artis. Language yesterday, today, tomorrow. The Journal of University of SS Cyril and Methodius in Trnava. Warsaw: De Gruyter Open, 2, pp. 185-249, 2017.

Abstract
|
Links
|
BibTeX

This paper addresses the diachronic development of combining forms in English scientific texts over approximately 350 years, from the early stages of the first scholarly journals that were published in English to contemporary English scientific publications. In this paper a critical discussion of the category of combining forms is presented and a case study is produced to examine the role of selected combining forms in two diachronic English corpora.

https://www.researchgate.net/publication/321776056_The_diachronic_development_of_combining_forms_in_scientific_writing

@article{Menzel2017,
title = {The diachronic development of combining forms in scientific writing},
author = {Katrin Menzel and Stefania Degaetano-Ortlieb},
url = {https://www.researchgate.net/publication/321776056_The_diachronic_development_of_combining_forms_in_scientific_writing},
year = {2017},
date = {2017},
journal = {Lege Artis. Language yesterday, today, tomorrow. The Journal of University of SS Cyril and Methodius in Trnava. Warsaw: De Gruyter Open},
pages = {185-249},
volume = {2},
number = {2},
abstract = {

This paper addresses the diachronic development of combining forms in English scientific texts over approximately 350 years, from the early stages of the first scholarly journals that were published in English to contemporary English scientific publications. In this paper a critical discussion of the category of combining forms is presented and a case study is produced to examine the role of selected combining forms in two diachronic English corpora.

},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Fischer, Stefan; Demberg, Vera; Teich, Elke

An information-theoretic account on the diachronic development of discourse connectors in scientific writing Inproceedings

39th DGfS AG1, Saarbrücken, Germany, 2017.

BibTeX

@inproceedings{Degaetano-Ortlieb2017b,
title = {An information-theoretic account on the diachronic development of discourse connectors in scientific writing},
author = {Stefania Degaetano-Ortlieb and Stefan Fischer and Vera Demberg and Elke Teich},
year = {2017},
date = {2017},
publisher = {39th DGfS AG1},
address = {Saarbr{\"u}cken, Germany},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Knappen, Jörg; Fischer, Stefan; Kermes, Hannah; Teich, Elke; Fankhauser, Peter

The making of the Royal Society Corpus Inproceedings

21st Nordic Conference on Computational Linguistics (NoDaLiDa) Workshop on Processing Historical language, Workshop on Processing Historical language, pp. 7-11, Gothenburg, Sweden, 2017.

Abstract
|
Links
|
BibTeX

The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of part-of-speech tagging.

@inproceedings{Knappen2017,
title = {The making of the Royal Society Corpus},
author = {J{\"o}rg Knappen and Stefan Fischer and Hannah Kermes and Elke Teich and Peter Fankhauser},
url = {https://www.researchgate.net/publication/331648134_The_Making_of_the_Royal_Society_Corpus},
year = {2017},
date = {2017},
booktitle = {21st Nordic Conference on Computational Linguistics (NoDaLiDa) Workshop on Processing Historical language},
pages = {7-11},
publisher = {Workshop on Processing Historical language},
address = {Gothenburg, Sweden},
abstract = {

The Royal Society Corpus is a corpus of Early and Late modern English built in an agile process covering publications of the Royal Society of London from 1665 to 1869 (Kermes et al., 2016) with a size of approximately 30 million words. In this paper we will provide details on two aspects of the building process namely the mining of patterns for OCR correction and the improvement and evaluation of part-of-speech tagging.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Kermes, Hannah; Teich, Elke

Average surprisal of parts-of-speech Inproceedings

Corpus Linguistics 2017, Birmingham, UK, 2017.

Abstract
|
Links
|
BibTeX

We present an approach to investigate the differences between lexical words and function words and the respective parts-of-speech from an information-theoretical point of view (cf. Shannon, 1949). We use average surprisal (AvS) to measure the amount of information transmitted by a linguistic unit. We expect to find function words to be more predictable (having a lower AvS) and lexical words to be less predictable (having a higher AvS). We also assume that function words‘ AvS is fairly constant over time and registers, while AvS of lexical words is more variable depending on time and register.

https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper207.pdf

@inproceedings{Kermes2017,
title = {Average surprisal of parts-of-speech},
author = {Hannah Kermes and Elke Teich},
url = {https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper207.pdf},
year = {2017},
date = {2017},
publisher = {Corpus Linguistics 2017},
address = {Birmingham, UK},
abstract = {We present an approach to investigate the differences between lexical words and function words and the respective parts-of-speech from an information-theoretical point of view (cf. Shannon, 1949). We use average surprisal (AvS) to measure the amount of information transmitted by a linguistic unit. We expect to find function words to be more predictable (having a lower AvS) and lexical words to be less predictable (having a higher AvS). We also assume that function words' AvS is fairly constant over time and registers, while AvS of lexical words is more variable depending on time and register.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Degaetano-Ortlieb, Stefania; Teich, Elke

Modeling intra-textual variation with entropy and surprisal: Topical vs. stylistic patterns Inproceedings

Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Computational Linguistics, pp. 68-77, Vancouver, Canada, 2017.

Abstract
|
Links
|
BibTeX

We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).

@inproceedings{Degaetano-Ortlieb2017,
title = {Modeling intra-textual variation with entropy and surprisal: Topical vs. stylistic patterns},
author = {Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://aclanthology.org/W17-2209},
doi = {https://doi.org/10.18653/v1/W17-2209},
year = {2017},
date = {2017},
booktitle = {Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature},
pages = {68-77},
publisher = {Association for Computational Linguistics},
address = {Vancouver, Canada},
abstract = {We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Kermes, Hannah; Degaetano-Ortlieb, Stefania; Knappen, Jörg; Khamis, Ashraf; Teich, Elke

The Royal Society Corpus: From Uncharted Data to Corpus Inproceedings

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), European Language Resources Association (ELRA), pp. 1928-1931, Portorož, Slovenia, 2016.

Abstract
|
Links
|
BibTeX

We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665-1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.

https://aclanthology.org/L16-1305

@inproceedings{Kermes2016,
title = {The Royal Society Corpus: From Uncharted Data to Corpus},
author = {Hannah Kermes and Stefania Degaetano-Ortlieb and J{\"o}rg Knappen and Ashraf Khamis and Elke Teich},
url = {https://aclanthology.org/L16-1305},
year = {2016},
date = {2016},
booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)},
pages = {1928-1931},
publisher = {European Language Resources Association (ELRA)},
address = {Portoro{\v{z}, Slovenia},
abstract = {We present the Royal Society Corpus (RSC) built from the Philosophical Transactions and Proceedings of the Royal Society of London. At present, the corpus contains articles from the first two centuries of the journal (1665-1869) and amounts to around 35 million tokens. The motivation for building the RSC is to investigate the diachronic linguistic development of scientific English. Specifically, we assume that due to specialization, linguistic encodings become more compact over time (Halliday, 1988; Halliday and Martin, 1993), thus creating a specific discourse type characterized by high information density that is functional for expert communication. When building corpora from uncharted material, typically not all relevant meta-data (e.g. author, time, genre) or linguistic data (e.g. sentence/word boundaries, words, parts of speech) is readily available. We present an approach to obtain good quality meta-data and base text data adopting the concept of Agile Software Development.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Fankhauser, Peter; Knappen, Jörg; Teich, Elke

Topical Diversification over Time in the Royal Society Corpus Inproceedings

Proceedings of Digital Humanities (DH'16)Proceedings of Digital Humanities (DH'16), Krakow, Poland, 2016.

Abstract
|
Links
|
BibTeX

Science gradually developed into an established sociocultural domain starting from the mid-17^th century onwards. In this process it became increasingly specialized and diversified. Here, we investigate a particular aspect of specialization on the basis of probabilistic topic models. As a corpus we use the Royal Society Corpus (Khamis et al. 2015), which covers the period from 1665 to 1869 and contains 9015 documents.

https://www.semanticscholar.org/paper/Topical-Diversification-Over-Time-In-The-Royal-Fankhauser-Knappen/7f7dce0d0b8209d0c841c8da031614fccb97a787

@inproceedings{Fankhauser2016,
title = {Topical Diversification over Time in the Royal Society Corpus},
author = {Peter Fankhauser and J{\"o}rg Knappen and Elke Teich},
url = {https://www.semanticscholar.org/paper/Topical-Diversification-Over-Time-In-The-Royal-Fankhauser-Knappen/7f7dce0d0b8209d0c841c8da031614fccb97a787},
year = {2016},
date = {2016},
booktitle = {Proceedings of Digital Humanities (DH'16)},
address = {Krakow, Poland},
abstract = {Science gradually developed into an established sociocultural domain starting from the mid-17^th century onwards. In this process it became increasingly specialized and diversified. Here, we investigate a particular aspect of specialization on the basis of probabilistic topic models. As a corpus we use the Royal Society Corpus (Khamis et al. 2015), which covers the period from 1665 to 1869 and contains 9015 documents.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

Kermes, Hannah; Knappen, Jörg; Khamis, Ashraf; Degaetano-Ortlieb, Stefania; Teich, Elke

The Royal Society Corpus. Towards a high-quality resource for studying diachronic variation in scientific writing Inproceedings

Proceedings of Digital Humanities (DH'16), Krakow, Poland, 2016.

Abstract
|
Links
|
BibTeX

We introduce a diachronic corpus of English scientific writing – the Royal Society Corpus (RSC) – adopting a middle ground between big and ‘poor’ and small and ‘rich’ data. The corpus has been built from an electronic version of the Transactions and Proceedings of the Royal Society of London and comprises c. 35 million tokens from the period 1665-1869 (see Table 1). The motivation for building a corpus from this material is to investigate the diachronic development of written scientific English.

https://www.researchgate.net/publication/331648262_The_Royal_Society_Corpus_Towards_a_high-quality_corpus_for_studying_diachronic_variation_in_scientific_writing

@inproceedings{Kermes2016a,
title = {The Royal Society Corpus. Towards a high-quality resource for studying diachronic variation in scientific writing},
author = {Hannah Kermes and J{\"o}rg Knappen and Ashraf Khamis and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://www.researchgate.net/publication/331648262_The_Royal_Society_Corpus_Towards_a_high-quality_corpus_for_studying_diachronic_variation_in_scientific_writing},
year = {2016},
date = {2016},
booktitle = {Proceedings of Digital Humanities (DH'16)},
address = {Krakow, Poland},
abstract = {

We introduce a diachronic corpus of English scientific writing - the Royal Society Corpus (RSC) - adopting a middle ground between big and ‘poor’ and small and ‘rich’ data. The corpus has been built from an electronic version of the Transactions and Proceedings of the Royal Society of London and comprises c. 35 million tokens from the period 1665-1869 (see Table 1). The motivation for building a corpus from this material is to investigate the diachronic development of written scientific English.

},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project: B1

«
1
2
3
4
»

Successfully