Keeping reference in pre-trained language models and multimodal corpora

Abstract

Formal models of discourse structure and interpretation use a complex system of constraint verification to construct meaning representations. These representations describe how entities and events are introduced and subsequently evolve throughout the discourse. Deep learning models, in contrast, build representations that are more robust but less intelligible. Furthermore, the intuition that language models implicitly capture and in turn also benefit from entity knowledge has been explored for some time now. In this presentation, I will firstly focus on the capacity of large pre-trained language models to track discourse entities, specifically their ability to predict whether various referring expressions in a text refer to the same entity at different points. In the second part of the talk, I will present a pilot study about the annotation of multimodal corpora with anaphora information. The integration of images and text offers a unique opportunity to train models with grounded entity (i.e., reference) knowledge. However, I will argue for the necessity of high-quality annotations before this process.