Publications

Khamis, Ashraf; Degaetano-Ortlieb, Stefania; Kermes, Hannah; Knappen, Jörg; Ordan, Noam; Teich, Elke

A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus Inproceedings

Corpus Linguistics 2015, Lancaster, 2015.
There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.

@inproceedings{Khamis-etal2015,
title = {A resource for the diachronic study of scientific English: Introducing the Royal Society Corpus},
author = {Ashraf Khamis and Stefania Degaetano-Ortlieb and Hannah Kermes and J{\"o}rg Knappen and Noam Ordan and Elke Teich},
url = {https://www.researchgate.net/publication/331648570_A_resource_for_the_diachronic_study_of_scientific_English_Introducing_the_Royal_Society_Corpus},
year = {2015},
date = {2015-07-01},
booktitle = {Corpus Linguistics 2015},
address = {Lancaster},
abstract = {

There is a wealth of corpus resources for the study of contemporary scientific English, ranging from written vs. spoken mode to expert vs. learner productions as well as different genres, registers and domains (e.g. MICASE (Simpson et al. 2002), BAWE (Nesi 2011) and SciTex (Degaetano-Ortlieb et al. 2013)). The multi-genre corpora of English (notably BNC and COCA) include fair amounts of scientific text too.
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Kermes, Hannah; Khamis, Ashraf; Knappen, Jörg; Teich, Elke

Information Density in Scientific Writing: A Diachronic Perspective Inproceedings

"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015), RWTH Aachen University, 2015.
We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).

@inproceedings{Degaetano-etal2015b,
title = {Information Density in Scientific Writing: A Diachronic Perspective},
author = {Stefania Degaetano-Ortlieb and Hannah Kermes and Ashraf Khamis and J{\"o}rg Knappen and Elke Teich},
url = {https://www.researchgate.net/publication/331648534_Information_Density_in_Scientific_Writing_A_Diachronic_Perspective},
year = {2015},
date = {2015-07-01},
booktitle = {"Challenging Boundaries" - 42nd International Systemic Functional Congress (ISFCW2015)},
address = {RWTH Aachen University},
abstract = {

We report on a project investigating the development of scientific writing in English from the mid-17th century to present. While scientific discourse is a much researched topic, including its historical development (see e.g. Banks (2008) in the context of Systemic Functional Grammar), it has so far not been modeled from the perspective of information density. Our starting assumption is that as science develops to be an established socio-cultural domain, it becomes more specialized and conventionalized. Thus, denser linguistic encodings are required for communication to be functional, potentially increasing the information density of scientific texts (cf. Halliday and Martin, 1993:54-68). More specifically, we pursue the following hypotheses: (1) As a reflex of specialization, scientific texts will exhibit a greater encoding density over time, i.e. denser linguistic forms will be increasingly used. (2) As a reflex of conventionalization, scientific texts will exhibit greater linguistic uniformity over time, i.e. the linguistic forms used will be less varied. We further assume that the effects of specialization and conventionalization in the linguistic signal are measurable independently in terms of information density (see below). We have built a diachronic corpus of scientific texts from the Transactions and Proceedings of the Royal Society of London. We have chosen these materials due to the prominent role of the Royal Society in forming English scientific discourse (cf. Atkinson, 1998). At the time of writing, the corpus comprises 23 million tokens for the period of 1665-1870 and has been normalized, tokenized and part-of-speech tagged. For analysis, we combine methods from register theory (Halliday and Hasan, 1985) and computational language modeling (Manning et al., 2009: 237-240). The former provides us with features that are potentially register-forming (cf. also Ure, 1971; 1982); the latter provides us with models with which we can measure information density. For analysis, we pursue two complementary methodological approaches: (a) Pattern-based extraction and quantification of linguistic constructions that are potentially involved in manipulating information density. Here, basically all linguistic levels are relevant (cf. Harris, 1991), from lexis and grammar to cohesion and generic structure. We have started with the level of lexico-grammar, inspecting for instance morphological compression (derivational processes such as conversion, compounding etc.) and syntactic reduction (e.g. reduced vs full relative clauses). (b) Measuring information density using information-theoretic models (cf. Shannon, 1949). In current practice, information density is measured as the probability of an item conditioned by context. For our purposes, we need to compare such probability distributions to assess the relative information density of texts along a time line. In the talk, we introduce our corpus (metadata, preprocessing, linguistic annotation) and present selected analyses of relative information density and associated linguistic variation in the given time period (1665-1870).
},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Greenberg, Clayton; Demberg, Vera; Sayeed, Asad

Verb Polysemy and Frequency Effects in Thematic Fit Modeling Inproceedings

Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, pp. 48-57, Denver, Colorado, 2015.

While several data sets for evaluating thematic fit of verb-role-filler triples exist, they do not control for verb polysemy. Thus, it is unclear how verb polysemy affects human ratings of thematic fit and how best to model that. We present a new dataset of human ratings on high vs. low-polysemy verbs matched for verb frequency, together with high vs. low-frequency and well-fitting vs. poorly-fitting patient rolefillers. Our analyses show that low-polysemy verbs produce stronger thematic fit judgements than verbs with higher polysemy. Rolefiller frequency, on the other hand, had little effect on ratings. We show that these results can best be modeled in a vector space using a clustering technique to create multiple prototype vectors representing different “senses” of the verb.

@inproceedings{greenberg-demberg-sayeed:2015:CMCL,
title = {Verb Polysemy and Frequency Effects in Thematic Fit Modeling},
author = {Clayton Greenberg and Vera Demberg and Asad Sayeed},
url = {http://www.aclweb.org/anthology/W15-1106},
year = {2015},
date = {2015-06-01},
booktitle = {Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics},
pages = {48-57},
publisher = {Association for Computational Linguistics},
address = {Denver, Colorado},
abstract = {While several data sets for evaluating thematic fit of verb-role-filler triples exist, they do not control for verb polysemy. Thus, it is unclear how verb polysemy affects human ratings of thematic fit and how best to model that. We present a new dataset of human ratings on high vs. low-polysemy verbs matched for verb frequency, together with high vs. low-frequency and well-fitting vs. poorly-fitting patient rolefillers. Our analyses show that low-polysemy verbs produce stronger thematic fit judgements than verbs with higher polysemy. Rolefiller frequency, on the other hand, had little effect on ratings. We show that these results can best be modeled in a vector space using a clustering technique to create multiple prototype vectors representing different “senses” of the verb.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Projects:   B2 B4

Crocker, Matthew W.; Demberg, Vera; Teich, Elke

Information Density and Linguistic Encoding (IDeaL) Journal Article

KI - Künstliche Intelligenz, 30, pp. 77-81, 2015.

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.

@article{crocker:demberg:teich,
title = {Information Density and Linguistic Encoding (IDeaL)},
author = {Matthew W. Crocker and Vera Demberg and Elke Teich},
url = {http://link.springer.com/article/10.1007/s13218-015-0391-y/fulltext.html},
doi = {https://doi.org/10.1007/s13218-015-0391-y},
year = {2015},
date = {2015},
journal = {KI - K{\"u}nstliche Intelligenz},
pages = {77-81},
volume = {30},
number = {1},
abstract = {

We introduce IDEAL (Information Density and Linguistic Encoding), a collaborative research center that investigates the hypothesis that language use may be driven by the optimal use of the communication channel. From the point of view of linguistics, our approach promises to shed light on selected aspects of language variation that are hitherto not sufficiently explained. Applications of our research can be envisaged in various areas of natural language processing and AI, including machine translation, text generation, speech synthesis and multimodal interfaces.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   A1 A3 B1

Kampmann, Alexander; Thater, Stefan; Pinkal, Manfred

A Case-Study of Automatic Participant Labeling Inproceedings

Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2015), 2015.

Knowlegde about stereotypical activities like visiting a restaurant or checking in at the airport is an important component to model text-understanding. We report on a case study of automatically relating texts to scripts representing such stereotypical knowledge. We focus on the subtask of mapping noun phrases in a text to participants in the script. We analyse the effect of various similarity measures and show that substantial positive results can be achieved on this complex task, indicating that the general problem is principally solvable.

@inproceedings{kampmann2015case,
title = {A Case-Study of Automatic Participant Labeling},
author = {Alexander Kampmann and Stefan Thater and Manfred Pinkal},
url = {https://www.bibsonomy.org/bibtex/256c2839962cccb21f7a2d41b3a83267?postOwner=sfb1102&intraHash=132779a64f2563005c65ee9cc14beb5f},
year = {2015},
date = {2015},
booktitle = {Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2015)},
abstract = {Knowlegde about stereotypical activities like visiting a restaurant or checking in at the airport is an important component to model text-understanding. We report on a case study of automatically relating texts to scripts representing such stereotypical knowledge. We focus on the subtask of mapping noun phrases in a text to participants in the script. We analyse the effect of various similarity measures and show that substantial positive results can be achieved on this complex task, indicating that the general problem is principally solvable.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   A2

Rohrbach, Marcus; Rohrbach, Anna; Regneri, Michaela; Amin, Sikandar; Andriluka, Mykhaylo; Pinkal, Manfred; Schiele, Bernt

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data Journal Article

International Journal of Computer Vision, pp. 1-28, 2015.

Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.

@article{rohrbach2015recognizing,
title = {Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data},
author = {Marcus Rohrbach and Anna Rohrbach and Michaela Regneri and Sikandar Amin and Mykhaylo Andriluka and Manfred Pinkal and Bernt Schiele},
url = {https://link.springer.com/article/10.1007/s11263-015-0851-8},
year = {2015},
date = {2015},
journal = {International Journal of Computer Vision},
pages = {1-28},
abstract = {

Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.
},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   A2

Klakow, Dietrich; Avgustinova, Tania; Stenger, Irina; Fischer, Andrea; Jágrová, Klára

The INCOMSLAV Project Inproceedings

Seminar in formal linguistics at ÚFAL, Charles University, Prague, 2014.

The human language processing mechanism shows a remarkable robustness with different kinds of imperfect linguistic signal. The INCOMSLAV project aims at gaining insights about human retrieval of information in the mode of intercomprehension, i.e. from texts in genetically related languages not acquired through language learning. Furthermore it adds to this synchronic approach a diachronic perspective which provides the vital common denominator in establishing the extent of linguistic proximity. The languages to be analysed are chosen from the group of Slavic languages (CZ, PL, RU, BG). Whereas the possibility of intercomprehension between related languages is a generally accepted fact and the ways it functions have been studied for certain language groups, such analyses have not yet been undertaken from a systematic point of view focusing on information en- and decoding at different linguistic levels. The research programme will bring together results from the analysis of parallel corpora and from a variety of experiments with native speakers of Slavic languages and will compare them with insights of comparative historical linguistics on the relationship between Slavic languages. The results should add a cross-linguistic perspective to the question of how language users master high degrees of surprisal (due to partial incomprehensibility) and extract information from “noisy” code.

@inproceedings{dietrich2014incomslav,
title = {The INCOMSLAV Project},
author = {Dietrich Klakow and Tania Avgustinova and Irina Stenger and Andrea Fischer and Kl{\'a}ra J{\'a}grov{\'a}},
url = {https://ufal.mff.cuni.cz/events/incomslav-project},
year = {2014},
date = {2014},
booktitle = {Seminar in formal linguistics at ÚFAL},
publisher = {Charles University},
address = {Prague},
abstract = {The human language processing mechanism shows a remarkable robustness with different kinds of imperfect linguistic signal. The INCOMSLAV project aims at gaining insights about human retrieval of information in the mode of intercomprehension, i.e. from texts in genetically related languages not acquired through language learning. Furthermore it adds to this synchronic approach a diachronic perspective which provides the vital common denominator in establishing the extent of linguistic proximity. The languages to be analysed are chosen from the group of Slavic languages (CZ, PL, RU, BG). Whereas the possibility of intercomprehension between related languages is a generally accepted fact and the ways it functions have been studied for certain language groups, such analyses have not yet been undertaken from a systematic point of view focusing on information en- and decoding at different linguistic levels. The research programme will bring together results from the analysis of parallel corpora and from a variety of experiments with native speakers of Slavic languages and will compare them with insights of comparative historical linguistics on the relationship between Slavic languages. The results should add a cross-linguistic perspective to the question of how language users master high degrees of surprisal (due to partial incomprehensibility) and extract information from “noisy” code.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Successfully