The bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

Abstract

Surprisal estimates from Transformer-based large language models (LLMs) are often used to model expectation-based effects in human sentence processing, which are facilitations in processing driven by the predictability of each upcoming word. This talk presents a series of analyses showing that surprisal estimates from LLM variants that are bigger and are trained on more data are worse predictors of processing difficulty that manifests in human reading times. First, regression analyses show a strong inverse correlation between model size and fit to reading times across three LLM families on two separate datasets. An error analysis reveals a systematic deviation for the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words. Subsequently, LLM variants that vary in the amount of training data show that their surprisal estimates generally provide the best fit after seeing about two billion training tokens and begin to diverge with more training data. The adverse influence of model size also begins to emerge at this point and becomes stronger as training continues. Finally, based on recent findings on the scaling behavior of LLMs, word frequency is presented as a unified explanation for these two effects. The theoretical implications of these results will be discussed.