Information Density in English Scientific Writing: A Diachronic Perspective
Project B1
The overarching goal of B1 is to gain insights into the role of rational communicative concerns in diachronic language change. Specifically, we are interested in the emergence of sublanguages or registers, i.e. distinctive, fairly persistent functional varieties, focusing on scientific English and its development in the late modern period (1700–1900) up to recent times. We started with the overall hypothesis of communicative optimization, stating that scientific English developed an optimal code for expert-to-expert communication over time. Based on a comprehensive corpus compiled from the publications of the Royal Society of London, we applied selected types of computational language models (e.g. topic models, n-gram models, word embeddings) and combined them with information-based measures (e.g. entropy, surprisal) to capture diachronic variation. Across different models, we observe the same trend of overall decreasing entropy with temporary peaks of high entropy/surprisal (innovation) and a continuous re-assessment of existing linguistic options, manifested by discarding options, shifting options to other contexts of use (diversification), or giving strong preference to one option over alternative ones (conventionalization). The choice-constraining effects associated with diversification and conventionalization point to a general diachronic mechanism for maintaining communicative function, which is a major novel insight arising from our studies.
In the next project phase we intend to address the following research questions. (RQ 1) Are the linguistic patterns characterizing the diachronic development of scientific language similar across registers/genres or are they different? If similar, this would be evidence of a more general diachronic mechanism. (RQ 2) Within scientific language, what are additional, typical imprints of conventionalization? So far, we have focused on linguistic features of the field of discourse. In order to arrive at a fuller picture of register formation, we now shift our focus to linguistic units and items that encode tenor and mode of discourse, such as formulaic expressions expressing stance and markers of discourse relations. (RQ 3) How can we assess the overall communicative efficiency of scientific language? We argued before that scientific language developed an optimal code for expert communication with a general diachronic preference for compact structures such as noun phrases. Surprisal alone cannot explain the advantages of this trend. Instead, we suspect that more compact structures come with positive effects on (working) memory, such as information locality. Therefore, we plan to investigate the interplay between memory and surprisal.
Focusing on selected linguistic phenomena (multi-word expressions, discourse markers, nominal vs. verbal phrases), we complement a corpus-based, production-oriented approach with selected behavioural, comprehension-oriented studies.
Keywords: diachronic linguistics, scientific discourse, register variation, relative information density