Lemke, Tyll Robin; Schäfer, Christine; Reich, Ingo

Modeling the predictive potential of extralinguistic context with script knowledge: The case of fragments

PLOS ONE, 16, pp. e0246255, 2021.

We describe a novel approach to estimating the predictability of utterances given extralinguistic context in psycholinguistic research. Predictability effects on language production and comprehension are widely attested, but so far predictability has mostly been manipulated through local linguistic context, which is captured with n-gram language models. However, this method does not allow to investigate predictability effects driven by extralinguistic context. Modeling effects of extralinguistic context is particularly relevant to discourse-initial expressions, which can be predictable even if they lack linguistic context at all. We propose to use script knowledge as an approximation to extralinguistic context. Since the application of script knowledge involves the generation of prediction about upcoming events, we expect that scrips can be used to manipulate the likelihood of linguistic expressions referring to these events. Previous research has shown that script-based discourse expectations modulate the likelihood of linguistic expressions, but script knowledge has often been operationalized with stimuli which were based on researchers’ intuitions and/or expensive production and norming studies. We propose to quantify the likelihood of an utterance based on the probability of the event to which it refers. This probability is calculated with event language models trained on a script knowledge corpus and modulated with probabilistic event chains extracted from the corpus. We use the DeScript corpus of script knowledge to obtain empirically founded estimates of the likelihood of an event to occur in context without having to resort to expensive pre-tests of the stimuli. We exemplify our method at a case study on the usage of nonsentential expressions (fragments), which shows that utterances that are predictable given script-based extralinguistic context are more likely to be reduced.