Fischer, Stefan; Knappen, Jörg; Teich, Elke

Using Topic Modelling to Explore Authors’ Research Fields in a Corpus of Historical Scientific English

Proceedings of DH 2018, Mexico City, Mexico, 2018.

In the digital humanities, topic models are a widely applied text mining method (Meeks and Weingart, 2012). While their use for mining literary texts is not entirely straightforward (Schmidt, 2012), there is ample evidence for their use on factual text (e.g. Au Yeung and Jatowt, 2011; Thompson et al., 2016). We present an approach for exploring the research fields of selected authors in a corpus of late modern scientific English by topic modelling, looking at the topics assigned to an author’s texts over the author’s lifetime. Areas of applications we target are history of science, where we may be interested in the evolution of scientific disciplines over time (Thompson et al., 2016; Fankhauser et al., 2016), or diachronic linguistics, where we may be interested in the formation of languages for specific purposes (LSP) or specific scientific “styles” (cf. Bazerman, 1988; Degaetano-Ortlieb and Teich, 2016). We use the Royal Society Corpus (RSC, Kermes et al., 2016), which is based on the first two centuries (1665–1869) of the Philosophical Transactions and the Proceedings of the Royal Society of London. The corpus contains 9,779 texts (32 million tokens) and is available at As we are interested in the development of individual authors, we focus on the single-author texts (81%) of the corpus. In total, 2,752 names are annotated in the single-author papers, but the activity of authors varies. Figure 1 shows that a small group of authors wrote a large portion of the texts. In fact, the twelve authors used for our analysis wrote 11% of the single-author articles.