To illuminate the neurophysiological basis of surprisal, Project A1 will extend a neurocomputational model, developed during the first phase of the project, so as to explicitly link surprisal to ERP correlates of comprehension.
Further, we will empirically test three predictions that follow from the model: 1) that the P600 is an index of interpretation-level surprisal, 2) that surprisal reflects the generation of forward and backward inferences, and 3) that surprisal reflects the interaction of world knowledge and linguistic experience.
Keywords: comprehension, scripts, schemas, events, discourse, surprisal,event-related potentials, eye-tracking
Project A3 aims at collecting formalized knowledge about prototypical sequences of events – script knowledge – from data, and using it to improve algorithms for natural language processing and our understanding of linguistic encoding choice and interpretation in human communication. The project will develop methods for learning scripts with wide coverage from unannotated texts and extend the representations of script events with information about their preconditions and effects to keep track of causal connections between events.
These deeper and wider-coverage script models will be applied to various natural language processing tasks, and used to model pragmatic interpretations; we will use and extend the Rational Speech Act (RSA) model as a framework for modelling pragmatic inferences and explore how the RSA model can be related to existing notions used in the SFB, specifically the UID hypothesis.
Keywords: psycholinguistics, computational linguistics, crowdsourcing, script knowledge, world knowledge, cognitive modeling, predictability
The central goal of project A4 is to examine how noise (or the effect of reduced hearing ability) will influence language comprehension, and how natural language generation systems can adapt their output to minimize the risk of misunderstanding. The experimental part of the project investigates neurophysiological correlates of bottom-up perceptual level and top-down predictive language processing, and how these functions interact when noise is added to the signal.
In the modelling part, we propose a noisy channel model, consisting of a component that models comprehension at different levels of hearing ability (based on insights from the experimental part of the project), and a generation component that optimizes the system-generated output in order to minimize the risk of misunderstanding, while also adapting the output to a target channel capacity.
Keywords: ID and channel capacity, language comprehension, aging, dual tasking, driving simulation, natural language generation, dialog systems, psycholinguistics
Project A5 examines the interplay of linguistic and visual context on making predictions and the interdependence with child development and individual differences in language and intellectual abilities.
Specifically, we will investigate whether children in contrast to adults make more simplified predictions; whether the frequency of prediction errors influences surprisal; and whether prediction (error) affects acquisition and storage of novel word meanings across different age groups. Understanding the relationship between prediction, error, and learning will shed light onto the neurocognitive basis of surprisal and deepen our understanding of its psychological reality.
Project A6 explores how different aspects of semantic suprisal modulate the formation and retrieval of episodic memories as indexed by behavioral and ERP measures. In a first step we will operationalize semantic surprisal by means of expectedness, unexpectedness or incongruence of a target word in a given sentence context and investigate how this modulates the encoding of memories for these words and the respective ERP measures.
Next it will be explored whether semantic surprisal not only modulates memory encoding but also affects memory retrieval processes and whether these effects can be generalized to implicit memory processes. Building on these findings it will be investigated whether semantic surprisal set by new fictional knowledge does support the formation of new memories in a similar way as previously established world knowledge does. Finally it will be explored how contextual factors influence the effects of semantic surprisal on memory formation and retrieval.
Keywords: memory formation
Project A7 aims to develop a system which generates effective technical instruction videos for the computer game “Minecraft”. This system will implement a rational speaker who trades off communicative success (being understood by the listener) against succinctness, in particular with respect to the listener’s domain knowledge and the division of labor between language and video.
To find near-optimal communicative choice efficiently enough for interactive use, the project will advance heuristic search methods within and across the generation of individual sentences.
Keywords: discourse generation
The project investigates the diachronic development of written scientific English, focusing on Information Density. On the basis of relevant data sets (e.g. Royal Society Corpus) computational language models are built for calculating information density/surprisal on different linguistic units (morphemes, words, syntactic phrases/constructions).
Selected phenomena of diachronic variation are investigated w.r.t. the role of information density along with other factors potentially involved in usage change. Both syntagmatic conditions and paradigmatic effects of change are studied.
Keywords: diachronic linguistics, scientific discourse, register variation, relative information density
Project B2 investigates rational models of language processing at the level of coherence relations. A central aim of the project is to jointly model the likelihood of communicative success (conveying the intended relation) and the linguistic encoding of a discourse relation in terms of connective choice (or omission of an explicit connective).
To this end, we explore the relationship between the uniform information density hypothesis and pragmatic rational models of communication, such as the rational speech act model (RSA). A modeling bottleneck lies in the small amount of training data for building automatic discourse relation parsers that can estimate discourse relation surprisal. This will be addressed using crowd-sourced annotations, as well as machine learning methods that enable us to exploit additional weaker signals from related tasks and explicitation of coherence relations during human translation.
Keywords: psycholinguistics, computational modelling, discourse relations
Project B3 investigates whether in a given utterance situation (high) redundancy in terms of (low) surprisal influences the speaker’s decision to elide (above and beyond grammatical needs). To this effect, B3 systematically contrasts elliptical and non-elliptical utterances in predictable as well as in non-predictable conditions in order to collect experimental measures of comprehension and production behaviour that can be linked to cognitive effort and, thus, surprisal.
The project is divided into 3 experimental work packages which basically vary the following three factors: (i) predictive properties of the discourse that precedes both antecedent and target, (ii) predictive morphosyntactic properties of the linguistic antecedent (matches vs. mismatches), and (iii) properties of the target (ellipsis vs. full form). The general hypothesis pursued in this project is that (all other things being equal) speakers show an increased preference for reduced structures the more predictable (the less surprising) the target area is, given the antecedent and/or the salient preceding discourse.
Keywords: theoretical linguistics, psycholingustics, ellipsis
Classical language models predict a word given a sequence of predecessor words. We will extend this to condition on knowledge from the environment that is to condition not only on the linguistics context but also one context from the real world. In one branch of the project, we will consider language models that also condition on an image.
Knowledge of the image in whose context the text was produced should help to predict the next word. In a second branch of the project we will consider, knowledge bases, question-answer data sets and states of a game as additional context. The surprisal and the predictability of an utterance like “Pawn from E2 to E4” depends on the present state of a chess game.
Keywords: language modelling, long range dependencies, memory
The project B6 continuation application is focused on addressing limitations of the information density methodological framework based on hand-crafted features and implemented and evaluated during the first phase of the project by making use of neural network approaches to capture and explore information density-based textual features, namely surprisal, with applications to translationese identification, machine translation evaluation and improvement.
A systematic comparison will be conducted between neural and standard count-based textual features with the objective of exploring the information encoded in continuous representations obtained by unsupervised and end-to-end learning methods. The combination of hand-crafted and neural information density features will provide an extension to the classic surprisal measure. In addition, neural approaches facilitate multi-granularity input representations and various context sizes for surprisal measure calculation. An important part of project B6 in the context of the CRC is the analysis and visualisation of representations learned by neural networks, in order to compare with features inspired by our linguistic intuitions.
Keywords: machine translation
Human translation is modelled on the basis of a noisy channel, as commonly done in machine translation. The two main objectives of translation, source language fidelity and target language conformity, are modelled probabilistically.
Different modes (interpreting, translation) and levels of expertise (learner, professional) are considered. The data set we use are translations of speeches from the EU Parliament which are compiled into a corpus. Computational translation models are built, which provide the basis for several studies on translationese, translation adequacy as well as translation complexity.
Keywords: human translation
Project C1 addresses the relation between information density and linguistic encoding in phonetics and human speech processing. In the second funding period, an elaborate account of the prosodic hierarchy and its interaction with the ID profile of utterances will be implemented.
The project will also investigate effects of channel characteristics and audience design on production and perception. With respect to methodology, the project will develop a procedure for evaluating the contribution of phonetic features to the informativity of linguistic units and investigate the combination of language models across different linguistic levels.
Keywords: phonetics, speech processing, syllable structure, collocations
Project C3 investigates information-theoretic explanations of the encoding choices and comprehension difficulty in complex constructions.
We focus on visually situated comprehension of referring expressions, which allow us to examine key information-theoretic aspects of syntactic encoding in order to evaluate the predictions of both bounded (e.g., UID) and pragmatic (e.g., RSA) accounts of rational communication. We will test these predictions using empirical methods as well a neurocomputional modeling.
Keywords: comprehension, production, encoding, linear order, syntactic variation, surprisal, event-related potentials, eye-tracking
Project C4 investigates the relation between information density, encoding density and grammaticalisation in a cross-linguistic perspective, focusing on intercomprehension within the family of Slavic languages.
In the second funding period,the research agenda is extended to spoken language, which allows us to investigate how information density is balanced between the acoustic and the text level in successful intercomprehension. At all levels from the acoustic signal and its phonetic structure to the texts generated from speech we develop similarity metrics and information density measures related to Slavic intercomprehension.
Keywords: intercomprehension, Slavistics, cross-lingual surprisal
Project C6 investigates whether information-related factors have an impact on syntactic variation, and if so, how the impact can be modelled. Specifically, it examines extraposition of nominal and prepositional phrases and relative clauses.
Moreover, the project will adopt a diachronic perspective, and try to assess the relevance of information-related factors for the diachronic development of extraposition, and to shed light on the role of the discourse mode (oral vs. written) in this development. The project will create synchronic and diachronic corpora, and analyse them qualitatively and quantitatively.
Keywords: syntactic variation
Project A2 is concerned with the development of wide-coverage, automatic methods for acquiring script knowledge, thus addressing the absence of such script knowledge bases. Since script event sequences are rarely explicit in natural prose, the project will use crowd-sourcing methods to create suitable corpora for script acquisition. These will then serve as the input for novel script-mining algorithms which will be used to induce psychologically plausible probabilistic script-automata representations.
Finally, distributional models will be applied to determine the semantic similarity of linguistic expressions, as conditioned by script knowledge – methods that are essential for applying scripts to real texts. The script resources created in this project will inform the development of experimental stimuli in A1, and will be used and evaluated directly in the models developed in A3.
Keywords: computational linguistics, crowdsourcing, script knowledge, world knowledge, script mining, distributional models
Project B5 addresses the problem of automatic relation extraction from densely encoded texts. Focusing on cross-sentence relation extraction, the project takes into account different linguistic encodings of a given relation. The research is corpus-based, compiling a collection of syntactically analysed mentions of selected relations exhibiting variation in encoding density.
The insights gained from analysis will be used to optimize automatic relation extraction taking into account encoding density and its relation to information density, and further shed light on whether the observed variation can be explained by uniform information density (UID).
Keywords: computational linguistics, relation extraction, distant supervision
Project C5 investigates how text-to-speech (TTS) synthesis techniques can be enhanced to take knowledge about information and encoding density into account. The project explores methods to connect and align the processing of high-level information with its encoding into low-level phonetic parameters in TTS synthesis. The approach is to encode information density in two stages: first, directly as high-level parameters during TTS voice building (offline) and, second, during runtime synthesis (online).
Quantification of information density can also be used to develop a model of listeners’ susceptibility to synthesis artifacts, in order to automatically predict and pre-emptively improve the perceived output quality by selecting a sequence of acoustic units that forms the desired variation and density of encoding given a defined degree of information density.
Keywords: text-to-speech synthesis, voicebuilding, acoustic correlates of information density