Script knowledge for natural language understanding
Saarland University, Saarbruecken, Germany, 2019.
While people process text, they make frequent use of information that is assumed to be common ground and left implicit in the text. One important type of such commonsense knowledge is script knowledge, which is the knowledge about the events and participants in everyday activities such as visiting a restaurant.
Due to its implicitness, it is hard for machines to exploit such script knowledge for natural language processing (NLP). This dissertation addresses the role of script knowledge in a central field of NLP, natural language understanding (NLU). In the first part of this thesis, we address script parsing. The idea of script parsing is to align event and participant mentions in a text with an underlying script representation. This makes it possible for a system to leverage script knowledge for downstream tasks. We develop the first script parsing model for events that can be trained on a large scale on crowdsourced script data. The model is implemented as a linear-chain conditional random field and trained on sequences of short event descriptions, implicitly exploiting the inherent event ordering information.
We show that this ordering information plays a crucial role for script parsing. Our model provides an important first step towards facilitating the use of script knowledge for NLU. In the second part of the thesis, we move our focus to an actual application in the area of NLU, i.e. machine comprehension. For the first time, we provide data sets for the systematic evaluation of the contribution of script knowledge for machine comprehension. We create MCScript, a corpus of narrations about everyday activities and questions on the texts. By collecting questions based on a scenario rather than a text, we aimed at creating challenging questions which require script knowledge for finding the correct answer. Based on the findings of a shared task carried out with the data set, which indicated that script knowledge is not relevant for good performance on our corpus, we revised the data collection process and created a second version of the data set.