The Role of Joint Embodiment in Situated Language-Based Interactions

Abstract

Large-scale pretraining has become the standard solution to automated reasoning over text and/or visual perception. But how far does this approach get us to systems that generalize to language use in realistic multi-agent situated interactions? First, I will talk about existing work in evaluating the spatial and compositional reasoning capabilities of current multimodal LMs. Then, I will talk about how these benchmarks miss a key aspect of real-world situated interactions: joint embodiment. I will discuss how joint embodiment in a shared world supports perspective-taking, an underlooked aspect of situated reasoning, and introduce a new environment and benchmark for studying the influence of perspective-taking on language use in interaction.