Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Knappen, Jörg; Teich, Elke

Information Density in Scientific Writing: A Diachronic Perspective

"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015), RWTH Aachen University, 2015.

We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).