Projects - SFB 1102

Area-A
Area-B
Area-C
Transfer Projects
Completed Projects

Area-A

Situational Context and World Knowledge

A1: Neurobehavioural Correlates of Surprisal in Online Comprehension

This project has used psycholinguistic experimentation and explicit computational modelling to establish that comprehension-centric surprisal—expectations driven by both knowledge about the world and linguistic experience—is reflected in the P600 component of the Event-Related Potential (ERP) signal. An overarching goal of the third phase is to generalize our previous findings to the processing of more naturalistic texts, and further to identify factors that may dynamically determine surprisal, such as depth of processing and domain knowledge. To address these questions, we will first establish the degree to which the P600 covaries with reading times—which we take as a behavioural index of comprehension-centric surprisal—during the continuous reading of naturalistic texts (WP1). The result of WP1 will essentially be a corpus of short texts, which includes the dependent neurophysiological and reading time measures for each word, as well as subject-specific comprehension and recall performance. This dataset will then be used as the basis for deriving a comprehensive taxonomy of predictors (WP2) for the processing indices recorded for each word—crucially allowing us to assess how to best determine comprehension-centric surprisal in order to estimate P600 and RT measures on a word-by-word basis (e.g., in terms of a combination of bottom-up n-gram surprisal and top-down ‘world knowledge’-driven expectations; Venhuizen et al., 2019a)—while also identifying and quantifying the influence of other predictors, such as association with the context, as well as behavioural performance. Based on the outcomes of WP1 and WP2, we will more directly examine how task and expertise modulate not only surprisal effects (WP3), but differentially influence the depth of comprehension and subsequent recall. Finally, we seek to refine the neurocomputational model developed in the previous phases (WP4) with a notion of depth of processing, and the consequence for surprisal, in order to account for findings from the experimental work packages.

Keywords: comprehension, scripts, schemas, events, discourse, surprisal,event-related potentials, eye-tracking

Principal Investigator

Matthew W. Crocker

A5: The Role of Language Experience and Surprisal for Learning and Memory

In the second phase of the CRC, we examined the role of language experience on predictive language processing, in particular (a) how prediction during language comprehension interacts with complex visual scenes in childhood, and how predictive processing influences the remembering and learning of new words (b) in middle childhood and (c) in younger and older adults.
In order to capitalize on these findings, we will continue to assess the role of language experience on predictive language processing and its influence on word learning and memory processes. In relation to the overarching theme for phase III information in flux, we will investigate (in WP 1) how varying working-memory demands during language comprehension, in addition to surprisal, influence fast learning of novel word meanings (implicit memory processes) as well as retrieving previously encountered content from memory (explicit memory processes). In WP 2, we focus on the interaction between plausibility and surprisal during online language comprehension and their joint impact on subsequent memory retrieval. In particular, we will investigate false memories for predictable but not encountered words, and how false memories are cognitively represented. In nearly all WPs, we will adopt a lifespan approach by comparing children, a well as younger and older adults to determine the impact of individual differences in language experience and working-memory capacity on surprisal. Whereas children and younger and older adults with respect to language experience, children and older adults differ from younger adults in their working-memory capacity. This approach will help us test whether traditional models of surprisal or more recent ones (in particular, the assumption of lossy memory representation) are better in accounting for empirical data.

Keywords: psycholinguistics

Principal Investigators

A6: Expectancy-based mechanisms during language comprehension and their relation to memory formation and retrieval

In Project A6 we use behavioural and electrophysiological measures to investigate how semantic surprisal in natural language contexts modulates learning and memory processes. Consistent with the pivotal role expectancy-based mechanisms play in online language comprehension, this project explores how these mechanisms are linked to memory formation and retrieval.

In three interrelated work packages, we want to take advantage of a predictive coding approach and explore how predictive processing shapes memory processes in different learning situations, i.e. declarative learning from prediction errors during the acquisition of an artificial grammar (WP 1), the acquisition of new knowledge and the role of intrinsic motivation (curiosity) (WP 2) and multilingual vocabulary learning (WP 3). In the final work package (WP 4), we plan to broaden our work on the mnemonic consequences of predictive processing from the level of single words to larger and more naturalistic texts by exploring the role of prediction errors for event segmentation in narrative reading. Taken together, we expect the combined outcome of these work packages to provide a detailed picture of the boundary conditions of how expectancy-based mechanisms during language processing shape memory processes and a comprehensive understanding of the neurocognitive mechanisms by which confirmed and disconfirmed predictions affect the formation and retrieval of new episodic memories.

Keywords: memory formation

Principal Investigators

A7: Controlling Information Density in Discourse Generation

The goal of A7 is to develop a natural language generation (NLG) system which generates building instructions in Minecraft. The NLG system implements a rational speaker model which trades off succinctness against clarity of the instructions. In particular, it rationally chooses the level of abstraction at which the parts of the construction are explained: for a human listener with sufficient domain knowledge, the instruction “build a railing on the other side” will be very efficient, but a listener who does not know what a railing looks like may need a more verbose block-by-block explanation to complete the construction successfully.
In the previous phase, we have developed such an NLG system through an innovative combination of a hierarchical planner with a sentence generator. The hierarchical planner determines at which level of abstraction the individual building steps will be explained, whereas the sentence generator realizes the building instructions in natural language. These components are integrated tightly, in that the sentence generator supplies the cost function on which the planner relies. We have shown in crowdsourcing evaluations that this integrated NLG system guides human users effectively in the construction of buildings in Minecraft, and that the choice of abstraction level significantly impacts task completion times and user satisfaction.
The main theme of A7 in the third phase will be “Minecraft in flux”: The NLG system will adapt to its user over the course of each construction, tailoring its language use to the user’s implicit preferences in order to optimize the clarity-succinctness tradeoff. It will recompute the instruction plan on the fly in response to an improved understanding of the user’s needs, through greatly accelerated neurosymbolic planning techniques. Finally, we will complement the NLG system with a “simulated user”, which follows building instructions in Minecraft instead of generating them. This will allow us to generate the necessary training data for the neurosymbolic planner, while at the same time providing a model of a listener in Minecraft to go along with our rational speaker.
Over the course of the second phase, we have also identified collaboration opportunities with other projects, for whom Minecraft is an attractive domain to test their hypotheses in a behavioural experimental setting. We will specifically dedicate project time in phase III to collaborate with B3 and C3 on ellipsis and referring expressions.

Keywords: discourse generation

Principal Investigators

A8: Adapting Text Generation to Individual Users

Project A8 is concerned with how to write a text in a way that a given reader can understand optimally. Our starting point is that readers differ in their cognitive properties, and that therefore, different choices of linguistic encoding will be optimal for different readers. We investigate this issue from two perspectives. In our psycholinguistic work packages, we will investigate how the interplay of cognitive properties and linguistic encoding affects comprehension, and how a specific reader’s cognitive properties can be inferred from their eye movements during reading. From a computational perspective, we will develop text-to-text generation systems which diagnose the reader’s cognitive properties and then rewrite a given text to be optimal for that reader.

WP 1 will be concerned with classifying readers with respect to their cognitive properties such as working memory capacity, lexical and linguistic knowledge based on their behaviour during reading. WP 2 aims to relate these cognitive properties to differences in language comprehension of different comprehenders, and identify how language should be adapted to benefit an individual’s comprehension. In WP 3, we will develop methods for manipulating generated text in terms of lexical and syntactic complexity, while WP 4 will focus on strategies for automatically adding or removing information from a text in order to fit a reader’s background knowledge. Finally, in WP 5, the diagnostic results from WP 1 and their implications for comprehension, as determined by WP 2, will be used to automatically adapt text production using methods from WPs 3 and 4, while also taking into account uncertainty about user properties.

Keywords: text generation

Principal Investigators

Top

Area-B

Discourse and Register

B1: Information Density in English Scientific Writing: A Diachronic Perspective

The overarching goal of B1 is to gain insights into the role of rational communicative concerns in diachronic language change. Specifically, we are interested in the emergence of sublanguages or registers, i.e. distinctive, fairly persistent functional varieties, focusing on scientific English and its development in the late modern period (1700–1900) up to recent times. We started with the overall hypothesis of communicative optimization, stating that scientific English developed an optimal code for expert-to-expert communication over time. Based on a comprehensive corpus compiled from the publications of the Royal Society of London, we applied selected types of computational language models (e.g. topic models, n-gram models, word embeddings) and combined them with information-based measures (e.g. entropy, surprisal) to capture diachronic variation. Across different models, we observe the same trend of overall decreasing entropy with temporary peaks of high entropy/surprisal (innovation) and a continuous re-assessment of existing linguistic options, manifested by discarding options, shifting options to other contexts of use (diversification), or giving strong preference to one option over alternative ones (conventionalization). The choice-constraining effects associated with diversification and conventionalization point to a general diachronic mechanism for maintaining communicative function, which is a major novel insight arising from our studies.
In the next project phase we intend to address the following research questions. (RQ 1) Are the linguistic patterns characterizing the diachronic development of scientific language similar across registers/genres or are they different? If similar, this would be evidence of a more general diachronic mechanism. (RQ 2) Within scientific language, what are additional, typical imprints of conventionalization? So far, we have focused on linguistic features of the field of discourse. In order to arrive at a fuller picture of register formation, we now shift our focus to linguistic units and items that encode tenor and mode of discourse, such as formulaic expressions expressing stance and markers of discourse relations. (RQ 3) How can we assess the overall communicative efficiency of scientific language? We argued before that scientific language developed an optimal code for expert communication with a general diachronic preference for compact structures such as noun phrases. Surprisal alone cannot explain the advantages of this trend. Instead, we suspect that more compact structures come with positive effects on (working) memory, such as information locality. Therefore, we plan to investigate the interplay between memory and surprisal.
Focusing on selected linguistic phenomena (multi-word expressions, discourse markers, nominal vs. verbal phrases), we complement a corpus-based, production-oriented approach with selected behavioural, comprehension-oriented studies.

Keywords: diachronic linguistics, scientific discourse, register variation, relative information density

Principal Investigators

B2: Cognitive Modelling of Information Density for Discourse Relations

A central goal of the third phase of project B2 is to investigate the marking of coherence relations cross-linguistically, and to extend the coverage of discourse relational resources to a larger variety of languages—this will include languages which have already received attention in discourse studies as well as more under-researched languages such as Nigerian Pidgin, which is in an interesting stage of language evolution. The project thus directly contributes to Focus Area ’Language typology, multilinguality and language change’ of the third phase of the CRC.
Specifically, we plan to focus on (i) cross-linguistic differences in the processing of discourse connectives, and to what extent these differences may be driven by information-theoretic principles; (ii) how differences in linearization between languages (placing discourse coherence devices in different positions within the relational arguments) affects the distribution of information across the relational arguments; and (iii) differences in the degree of specificity and function of discourse markers, which may affect the amount of information conveyed by these markers, and in turn may affect their usage distributions.
In order to address these goals, we aim to annotate a cross-lingual corpus with discourse relation information (WP 1) using crowd-sourcing methods developed in earlier project phases. The project will combine corpus-based investigations with psycholinguistic experiments intended to specifically test for processing differences between speakers of different languages (WPs 3, 4 and 5). Furthermore, the project will contain a computational work package which provides automatic tools for mapping between annotations of different languages and transfer of information across languages, and which will develop discourse connective identifiers and relation classifiers for under-resourced languages (WP 2).

Keywords: psycholinguistics, computational modelling, discourse relations

Principal Investigator

Vera Demberg

B3: Information Theory and Ellipsis Redundancy

The overarching goal of project B3 is to investigate to what extent the usage of ellipsis is determined by information-theoretic concepts like surprisal and entropy. This perspective on ellipsis complements existing linguistic theories, since the latter only tackle the question under which syntactic and semantic conditions an ellipsis is, in principle, grammatically licensed. Focusing on discourse-initial fragments and constituent ellipsis in phases I and II, respectively, we found strong evidence that the speaker’s choice between alternative encodings is essentially guided by two pragmatic imperatives of rational communication: “Avoid peaks!” and “Avoid troughs!”. In phase III, we will shift our focus to the remaining major class of ellipses: coordination ellipsis. To tie in with phases I and II, we first aim at confirming our results on the avoidance of peaks and troughs also in the area of coordination ellipsis. At the same time, we aim at strengthening the audience design hypothesis, a crucial prerequisite for Uniform Information Density (UID), in more interactive settings. Furthermore, the dynamic nature of coordination opens a window into the way the processing of prosodic information affects the recipient’s common ground, changes her expectations, and modulates the predictability of ellipsis as a function of time. Especially with right node raising, we expect a trade-off between surprisal and memory here. Finally, B3 is a hub for bringing together the research conducted in the CRC, which is at the interface between information structure and information theory.

Keywords: theoretical linguistics, psycholingustics, ellipsis

Principal Investigators

B4: Modelling and Measuring Information Density

B4 in the third phase continues our long-term research effort of exploring and shifting the limits of what neural language models are capable of. While in phase I our focus was on modelling long-range dependencies, phase II concentrated on improving language models by leveraging additional modalities and providing a better understanding of their inner workings. Now in the third phase, the main goal of B4 is to study the adaptation of language models as well as how to use them for dynamically changing tasks and data distributions, tightly relating to the notion of information in flux. In addition we will be exploring resource usage of neural language models to study the relationship between memory efficient models and adaptation and explore novel ways to represent knowledge in language models, which we belief is tightly connected to making language models more adaptable and robust to distribution changes.
WP1 will research new language modelling techniques that can cope with temporally changing or drifting data. In WP1 we will evaluate purely in terms of perplexity. This will change in WP2. Here will will explore what happens to language models fine-tuned on downstream tasks in a temporally changing setting building on top of the models developed in WP1. In WP3 we will turn to the second main theme of B4 in phase III: memory efficiency. Memory in WP3 refers to compute resources used by the language model. We will explore more parameter efficient language models and study the relationship between efficiency, adaptation, and generalization. WP4 will serve as the conceptual backbone for all work packages as we expect models that combine parametric with non-parametric representations to be to be more adaptable and generalize better. WP5 finally is a work package, where we study memory in language models from a different perspective: which parts of the history (that is the words preceding the word for which the surprisal is calculated) are most relevant for surprisal calculation. WP5 will provide the modelling relevant for A5.

Keywords: language modelling, long range dependencies, memory

Principal Investigator

Dietrich Klakow

B6: Unravelling Linguistic Knowledge via Multilingual Embedding Spaces and Latent Information

Embeddings (monolingual, multilingual, static, contextualized) are the workhorses of modern language technologies. They are based on the distributional hypothesis and can capture semantic, grammatical, morphological and other information. Most embeddings are now prediction-based (Mikolov et al., 2013). Embeddings can be at the word, sub-word, sentence, paragraph or document levels. They need (very) large amounts of data to be trained on. To date, most monolingual embeddings have been done for English and other such resource-rich languages, and similarly, multilingual embeddings usually involve English. Multilingual embeddings are especially promising (Devlin et al., 2019). Word translations are close in multilingual embedding spaces, sentence translations in sentence embedding spaces, and models allow fine-tuning and few- and zero-shot learning. Multilingual embeddings can be computed in a joint space or through alignment of monolingual spaces, and their final quality strongly depends on the languages and domains involved (Søgaard et al., 2018).
Mono- and multilingual embeddings constitute the core technology underpinning our previous work on translationese in B6 phase II. Our research in phase II showed for the first time that (i) departures from isomorphism between simple monolingual word embedding spaces computed from original and translated material allow us to detect translationese effects, to the extent that we can estimate phylogenetic trees between the source languages of the translations (Dutta Chowdhury et al., 2020, 2021); (ii) feature- and representation-learning approaches systematically outperform hand-crafted and linguistically inspired feature-engineering-based approaches on translationese classification (Pylypenko et al., 2021); and (iii) our feature- and representation-learning-based cross- and multilingual classification experiments provide empirical evidence of cross-language translationese universals (Pylypenko et al., 2021). Ranking of single hand-crafted features based on R2 of linear classifiers to predict output of the best-performing BERT model shows that language-model-based average surprisal (perplexity) features account significantly for parts of the variance of the neural model.
For the new B6 proposal our research goals are foundational as well as practical: building on B6 phase II, we extend our research on multilinguality and translationese, addressing theoretical as well as practical questions about (i) information spreading in embedding spaces, (ii) capturing translationese subspaces and (iii) extracting latent background knowledge from bilingual data. We seek to apply answers to the foundational questions to improve NLP applications, including in particular NLP for low-resource languages, machine translation and perhaps even general multilingual technologies. From a foundational point of view, we focus on what is captured by multilingual embeddings: what patterns are manifest in embedding data? Can we detect patterns (clusters) with and without linguistic labels? How do they compare? Do clusters naturally emergent in embedding space (without linguistic labels) correspond to linguistic typology? Where and why do they differ across languages? How do we capture situations where isomorphism between embedding spaces does not and should not hold? Can we identify, compute and use translationese subspaces? How can we automatically capture and quantify latent background knowledge from translations? Answers to these questions may support applications: to achieve optimal results in general as well as for low-resource multilingual models, should we cluster languages that pattern in a similar way (as in e.g. “cardinality-based” MT)? Which applications benefit from clustering? Can clustering optimize few- or zero-shot learning? Can properties of multilingual embedding spaces lead to better lexicon induction for self- and unsupervised machine translation supporting low-resource scenarios? Can translationese subspaces improve machine translation? Can capture of latent cultural background knowledge from translation and general multilingual data reduce perplexity of language models?
Importantly, we will explore to what extent our findings can be modelled or explained in terms of the information-theoretic concepts that take centre stage in the CRC, including entropy and surprisal: can clustering in multilingual embedding spaces be usefully described in terms of entropy? To what extent do applications of translationese subspaces register in terms of increased or decreased surprisal in translation output? To what extent can latent background knowledge be used to gain an improved notion of surprisal for the results of a translation? Our proposal targets Focus Area (3) ’Language typology, multilinguality, language change’ of phase III of the CRC.

Keywords: machine translation

Principal Investigators

B7: Translation as Rational Communication

Project B7 focuses on the specific linguistic properties of translation, i.e. non-randomly-occurring linguistic features that distinguish translations from original productions. Such properties are commonly referred to as “translationese” and emerge through the translation- inherent dilemma of ensuring source language fidelity while adhering to target language rules and norms. Our overarching research question is to what extent translationese effects can be described and explained in an information-theoretic framework of rational communication. Our approach is corpus-based using selected computational language models and information-theoretic measures including surprisal and entropy to assess translationese effects. Adopting an information-theoretic perspective on translationese is a novel idea and promises new insights into the general mechanisms underlying translation.
In the first project phase our focus was on comparable corpora, i.e. records of translations, both written and spoken (interpreting), and comparable (in-domain), original productions in the target language, which we built specifically from European Parliament data. We employed selected information-theoretic measures to compare models of different translation modes (interpreting vs. translation), expertise (professionals vs. learners) and languages (German, English, Spanish). Across all our studies, we found that interpreting overemphasizes features of oral, online production and translation overemphasizes features of planned, written discourse.
In the next project phase, we embark on explaining our descriptive findings on rational communication grounds. First, we plan to extend our modelling efforts with a memory component and analyse the interplay between memory and surprisal for an overall mechanistic explanation of translationese. Specifically, we expect that some translationese features are associated with the attempt to optimize working memory, especially in a high-pressure situation like simultaneous interpreting. We will investigate selected linguistic phenomena that we found to be involved in translationese/interpretese in more detail in terms of surprisal and the interplay of memory and surprisal, including vocabulary and syntactic biases, use of specific discourse markers (connectives, particles) and (co-)reference patterns as well as disfluencies (interpreting). Second, we gear linguistic analysis to the translation relation proper because this is where explanations of translationese effects must ultimately be sought, complementing results from comparable corpora with analysis of parallel corpora to micro-inspect possible triggers in the source language for specific target language choices, thus bringing in a contrastive-linguistic/typological perspective.

Keywords: human translation

Principal Investigators

Top

Area-C

Variation in Linguistic Encoding

C1: Information Density and the Predictability of Phonetic Structure

C1 is concerned with the relation between information density and linguistic encoding in phonetics and human speech processing. After investigating effects on subword (segmental and syllable) levels, in the third phase C1 will explore how information-theoretic effects change dynamically, in adults and children, as interactive tasks and conversations unfold. We investigate the extent to which information structure and accommodative behaviour correlate with information-theoretic factors. We will incorporate the results from all three phases in order to provide a consolidated view of the effects of structure-based predictability on the phonetic details of spoken language.

Keywords: phonetics, speech processing, information structure, accommodation

Principal Investigators

C3: Rational Encoding and Decoding of Referring Expressions

C3 investigates information-theoretic explanations of encoding and decoding behaviour, with an emphasis on the mechanisms that underlie the linearization of referring expressions. Rational theories of communication assume that interlocutors optimize successful transmission of information by reasoning about their communicative partner, suggesting that the production and comprehension systems may be intertwined in order to enable speakers to be better understood, and to allow listeners to reason about the speaker’s intentions. In this project, we aim to investigate this interdependence of production and comprehension. Firstly, our experimental investigations seek to establish the extent to which rational theories explain linearization preferences across multiple referring expressions, by investigating whether the “given before new” maxim of information structure can be viewed as an instance of the “expected before unexpected” preference following from UID. Secondly, we examine whether there is evidence for lexically specific prediction effects in comprehension, as would be expected under the prediction-by-production hypothesis put forward by Pickering and Gambi (2018). Lastly, we examine which rational encoding strategies—e.g. UID, Maximal Informativity, and Information Status—best explain speaker behaviour with regard to linearization of multiple referential expressions in situated language use. Together, these studies examine how information in flux obtains when early encoding choices dynamically modulate the information density of the signal that follows. Our findings will directly inform the development of an integrated computational model of expectation-based comprehension and production processes, with the aim of explicitly characterizing how these mechanisms are combined to support the kind of reasoning about listener/speaker behaviour that is assumed by current rational accounts.

Keywords: comprehension, production, encoding, linear order, syntactic variation, surprisal, event-related potentials, eye-tracking

Principal Investigators

C4: Mutual Intelligibility and Surprisal in Slavic Intercomprehension (INCOMSLAV-3)

In the first two phases of the CRC, the empirical focus of C4 was on the mutual intelligibility of visual (written) or auditory (spoken) input for speakers of closely related languages in the Slavic language family. Experimental and modelling work in the second phase, which has combined methods from language, speech and translation technology, has provided a wealth of findings highlighting how information density is distributed across the acoustic and the text channels in successful intercomprehension. Based on these results, we are now in a position to address, in the third phase, core properties of intercomprehension as they unfold in goal-oriented communication, characterized by cooperative behaviour and adaptive interaction. This overarching goal entails the investigation of linguistic structures beyond lexical similarity and word sequence based predictability, taking into account constructional similarity, the cross-lingual transparency of multi- component units, and prosody. Specifically, conversational dialogue-style experimental setups are employed in order to explore the (ex)change of information as the interaction unfolds. We will develop models of surprisal capturing the information conveyed by multi- component units and prosodic features, in particular intonation. Finally, C4 will validate the scalability of our results and models in terms of a transfer to a selected set of features of other language families, e.g. Semitic.

Keywords: intercomprehension, Slavistics, cross-lingual surprisal

Principal Investigators

C6: Information Management as a Factor for Syntactic Variation in the History of German

The overarching goal of project C6 is an information-theoretic account of specific types of syntactic variation throughout the history of German. In the first project phase, we investigated the role that surprisal plays for extraposition of constituents. In the next phase, we want to focus on the serialization of constituents relative to each other. In German, word order is only partially determined by grammatical factors such as syntactic function or case: It has been demonstrated abundantly in the literature that an important factor for the serialization in German is information structure (e.g., Lenerz, 1977; Musan, 2002; Frey, 2004; Speyer, 2011, 2015a; Rauth, 2020). In information-structural research, constraints such as “given before new information” (Lenerz, 1977; Musan, 2002; Frey, 2004) have been identified as highly relevant for German word order. In particular, this concerns the serialization in the so-called middle field, i.e., the part of the clause between the finite verb or the conjunction (in main or subordinate clauses, respectively) and the remaining verbal elements. Such constraints also had an impact on word order in earlier stages of German (e.g., Speyer, 2011, 2015a; Rauth, 2020), although it is not clear whether their impact decreased or increased over time.
We assume that information-structural notions such as information status are correlated to surprisal, or at least concomitant with it (see, e.g., Speyer and Lemke, 2017; Speyer and Voigtmann, 2021). For example, an explanation for the observed constraint “given > new” could be that the new information conveyed later in the clause is in some ways made more predictable by the given information conveyed earlier in the clause, in that the given information sets up some expectations as to what the new information could be. Thus, the surprisal of the new information would be lowered, which could be a strategy to smooth out the information profile of clauses in accordance with the Uniform Information Density hypothesis (UID; Levy and Jaeger, 2007).
We will test this assumption with modern and historical corpus data to show whether changes in word and constituent order are related to and can be explained by information-theoretic concepts like UID. For this, we first transfer existing automatic methods for annotating syntactic categories and information structure to historical data to create a data base of authentic text samples. Next, we generate information profiles and calculate surprisal curves for these samples and compare them to the profiles and curves of a variant corpus. A variant corpus (as proposed in the current phase) is an artificial “parallel” corpus that we generate from the authentic samples by changing specific linguistic parameters according to our hypothesis while other parameters are kept constant. This allows us to quantify the effects of the information profile and other information-theoretic measures on the syntactic changes in question.
Besides the corpus analysis, we will validate our corpus findings with experiments on ‘living’ German by exploring whether the influence of information density on constituent order that we expect to find can be correlated to processing difficulties. In that respect, we will cooperate with projects that have an experimental approach (as detailed in the work packages).

Keywords: syntactic variation

Principal Investigators

C7: Cross-linguistic Information-Theoretic Modelling of Communicative Efficiency

Information-theoretic modelling of cross-linguistic data has uncovered general principles of language optimization—dependency length minimization and dependency locality (Futrell et al. 2015, 2020)—and its link to efficient memory use (Ferrer i Cancho 2015; Hahn et al. 2021), among others. While existing studies usually provide statements about the overall difference between languages, they do not inspect in detail which language-specific structures license them. In this project, we aim to replicate some of this work using a more comprehensive dataset and build models to explain the original findings further, from an explicitly cross-linguistic perspective, sampling 33 Indo-European languages and 10 diverse non-Indo-European languages. We believe that, as tentatively suggested by the accounts cited above and by earlier work in linguistic typology (Givón 1988; Gundel 1988; Herring 1990; Payne 1990; Skopeteas and Fanselow 2010), the major factor missing in these accounts is the impact of information structure on word order variability: most languages of the world allow word order changes to encode specific changes in the given/new/contrastive status of constituents (Gundel 1988). While word order is partly taken into account in a general sense in cross-linguistic information-theoretic studies, very little attention has been paid to the impact of word order variability on the fit of these models. Languages differ widely regarding the amount of word order variability that they allow, which word orders are ‘rigid’ and which ones are ‘free’, and how much of that variability is dependent on information structure. Hence, there are large cross-linguistic differences regarding the word order—information structure interface that have not explicitly informed information-theoretic modelling thus far. This project aims to remedy that.
Our main aim is to incorporate information status in information-theoretic modelling of language use in an explicit and cross-linguistic fashion in order to investigate communicative efficiency in terms of locality and the memory-surprisal tradeoff, building on Futrell et al. (2015) and Hahn et al. (2021). Secondly, given known cross-linguistic variation of the word order–information structure interface we also determine how much of the fit of these information-theoretic models is dependent on word order variability. We consider which word orders (amongst others, the word order of nominal heads and different types of modifiers; arguments and their verbal head; clauses and their heads) are variable across the sampled languages, and how word order variability interacts with minimization of dependencies, both from a cross-linguistic as well as a language-internal perspective. Our third goal is to study the relation between information-theoretic concepts (surprisal/information density) and information status; these are conceptually related but this relation is empirically under-studied. Lastly, we aim to contribute and enrich the long-standing discussion of communicative efficiency in typology with a conceptual framework combining information status and information theory, focusing on the interaction between (overt) morphological marking of syntactic arguments and word order (variability), including additionally a diachronic perspective.

Keywords: language typology

Principal Investigator

Annemarie Verkerk

Top

Transfer Projects

T1: Information density and linguistic encoding in “Leichte Sprache“ (IDeaLite)

This transfer project applies insights gained from the study of the way information is linguistically encoded in CRC 1102 “Information Density and Linguistic Encoding” to the analysis and evaluation of texts in LEICHTE SPRACHE (Easy German), which is an umbrella term for different forms of regulated German that have been created to make written information accessible for low-literacy readers (Netzwerk Leichte Sprache, 2014; Bredel and Maaß, 2016; Bock, 2018b). LEICHTE SPRACHE is high on the political agenda and a core measure in creating equal opportunities. Providing information in Easy German alongside Standard German is requested from public institutions (cf. Bundesteilhabegesetz und Nationaler Aktionsplan 2.0) and increasingly provided also by organizations in the social sector.
Specifically, we pursue the following goals: (1) to show that the models of information density we have developed in the course of CRC 1102 are suitable to assess texts in terms of fit for specific user groups (here, people with low literacy due to learning difficulties); (2) to extend existing surprisal-based models by integrating additional linguistic phenomena relevant in the context of Easy German (especially text cohesion and coherence) and (3) to engage with users and providers of Easy German eliciting their expert insights and judgment and, in turn, to give them linguistically informed input and practical advice on user group–tailored text production. In terms of methods, we pursue a corpus-based approach combined with selected experiments. To explore Easy German, we compare corpora of Standard Language
and Easy German in selected domains (e.g. newspaper text, institutional websites) and apply selected information-theoretic measures, such as surprisal of words and syntactic patterns, the entropy over lexical usage (cf. previous work by project B1/Teich) and the information density of sentences (cf. previous work by B3/Reich). Judgment tasks of text variants independently assumed to be “easy” vs. “standard” and simple completion tasks that involve users of Easy German are carried out to obtain complementary evidence from individuals in real-time interaction.
The research relies on close collaboration with external partners: Arbeiterwohlfahrt (AWO) Saarland and Netzwerk Leichte Sprache with AWO Dillingen/Saarlouis as our official application partner, as well as Heike Zinsmeister as an external academic PL who brings in specific expertise on corpus-based analysis of Easy German and will liaise with Netzwerk Leichte Sprache.

Keywords: easy German

Principal Investigators

Top

Completed Projects

A2: Script Knowledge for Modelling Semantic Expectation

Project A2 is concerned with the development of wide-coverage, automatic methods for acquiring script knowledge, thus addressing the absence of such script knowledge bases. Since script event sequences are rarely explicit in natural prose, the project will use crowd-sourcing methods to create suitable corpora for script acquisition. These will then serve as the input for novel script-mining algorithms which will be used to induce psychologically plausible probabilistic script-automata representations.

Finally, distributional models will be applied to determine the semantic similarity of linguistic expressions, as conditioned by script knowledge – methods that are essential for applying scripts to real texts. The script resources created in this project will inform the development of experimental stimuli in A1, and will be used and evaluated directly in the models developed in A3.

Keywords: computational linguistics, crowdsourcing, script knowledge, world knowledge, script mining, distributional models

Principal Investigators

A3: Modelling the Information Density of Event Sequences in Texts

Project A3 aims at collecting formalized knowledge about prototypical sequences of events – script knowledge – from data, and using it to improve algorithms for natural language processing and our understanding of linguistic encoding choice and interpretation in human communication. The project will develop methods for learning scripts with wide coverage from unannotated texts and extend the representations of script events with information about their preconditions and effects to keep track of causal connections between events.

These deeper and wider-coverage script models will be applied to various natural language processing tasks, and used to model pragmatic interpretations; we will use and extend the Rational Speech Act (RSA) model as a framework for modelling pragmatic inferences and explore how the RSA model can be related to existing notions used in the SFB, specifically the UID hypothesis.

Keywords: psycholinguistics, computational linguistics, crowdsourcing, script knowledge, world knowledge, cognitive modeling, predictability

Principal Investigators

A4: Language Comprehension in a Noisy Channel

The central goal of project A4 is to examine how noise (or the effect of reduced hearing ability) will influence language comprehension, and how natural language generation systems can adapt their output to minimize the risk of misunderstanding. The experimental part of the project investigates neurophysiological correlates of bottom-up perceptual level and top-down predictive language processing, and how these functions interact when noise is added to the signal.

In the modelling part, we propose a noisy channel model, consisting of a component that models comprehension at different levels of hearing ability (based on insights from the experimental part of the project), and a generation component that optimizes the system-generated output in order to minimize the risk of misunderstanding, while also adapting the output to a target channel capacity.

Keywords: ID and channel capacity, language comprehension, aging, dual tasking, driving simulation, natural language generation, dialog systems, psycholinguistics

Principal Investigators

B5: The Extraction of Complex Information and Encoding Density (EXCITED)

Project B5 addresses the problem of automatic relation extraction from densely encoded texts. Focusing on cross-sentence relation extraction, the project takes into account different linguistic encodings of a given relation. The research is corpus-based, compiling a collection of syntactically analysed mentions of selected relations exhibiting variation in encoding density.

The insights gained from analysis will be used to optimize automatic relation extraction taking into account encoding density and its relation to information density, and further shed light on whether the observed variation can be explained by uniform information density (UID).

Keywords: computational linguistics, relation extraction, distant supervision

Principal Investigator

Hans Uszkoreit

C5: Information Density Aware Text-to-Speech Synthesis

Project C5 investigates how text-to-speech (TTS) synthesis techniques can be enhanced to take knowledge about information and encoding density into account. The project explores methods to connect and align the processing of high-level information with its encoding into low-level phonetic parameters in TTS synthesis. The approach is to encode information density in two stages: first, directly as high-level parameters during TTS voice building (offline) and, second, during runtime synthesis (online).

Quantification of information density can also be used to develop a model of listeners’ susceptibility to synthesis artifacts, in order to automatically predict and pre-emptively improve the perceived output quality by selecting a sequence of acoustic units that forms the desired variation and density of encoding given a defined degree of information density.

Keywords: text-to-speech synthesis, voicebuilding, acoustic correlates of information density

Principal Investigator

Ingmar Steiner

Top