Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Ordan, Noam; Teich, Elke

The taming of the data: Using text mining in building a corpus for diachronic analysis

Varieng - From Data to Evidence (d2e), University of Helsinki, 2015.

Social and historical linguistic studies benefit from corpora encoding contextual metadata (e.g. time, register, genre) and relevant structural information (e.g. document structure). While small, handcrafted corpora control over selected contextual variables (e.g. the Brown/LOB corpora encoding variety, register, and time) and are readily usable for analysis, big data (e.g. Google or Microsoft n-grams) are typically poorly contextualized and considered of limited value for linguistic analysis (see, however, Lieberman et al. 2007). Similarly, when we compile new corpora, sources may not contain all relevant metadata and structural data (e.g. the Old Bailey sources vs. the richly annotated corpus in Huber 2007).