Resources - SFB 1102

A dataset of human recognition of synonyms in noise

This dataset enlists a large variety of human mishearings of synonyms in noise, when they were presented with linguistic context (91 pairs) as well as in isolation (189 pairs). The annotations were collected from a group of 126 native British English speakers with normal hearing thresholds.

Reference: Anupama Chingacham, Vera Demberg, Dietrich Klakow; Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors; Proceedings of Interspeech 2021, pp. 1713–1717, 2021.

Contact person: Anupama Chingacham

Link to the resource

Back-translation Annotated Implicit Discourse Relations

This resource contains annotated implicit discourse relation instances. These sentences are annotated automatically by the back-translation of parallel corpora. For details please refer to reference below.

Reference: Shi, W., Yung, F., Rubino, R., & Demberg, V. (2017). Using Explicit Discourse Connectives in Translation for Implicit Discourse Relation Classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing.

Contact person: Wei Shi

Link to the resource

C6C

C6C is a Python pipeline to convert different historical corpora from various input formats to common, standardized output format(s). It also allows for custom processing of the input data, e.g., to map historical POS tagsets to the modern standard tagset.

Contact person: Stefanie Dipper

Link to the resource

C6Samples

The corpus contains manual annotations of sentence brackets (Wöllstein 2014), extraposed and adjacent relative clauses with their antecedent, position and type of restrictiveness, as well as extraposed prepositional phrases, nominal phrases, adjective phrases, adverbial
phrases and comparative elements with an embedded counterpart to allow comparison.

Contact person: Stefanie Dipper

Link to the resource

DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. It has 40 scenarios with approximately 100 ESDs each. The corpus also has partial alignments of event descriptions that are semantically similar with respect to the given scenario.

Reference: Wanzare, L., Zarcone, A. , Thater, S. & Pinkal, M. (2016). DeScript: A Crowdsourced Database for the Acquisition of High-quality Script Knowledge. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Stefan Thater

Link to the resource

Detecting everyday scenarios in Narrative texts

This resource contains sentence level annotations, with sentences (segments) labeled according to the scripts they instantiate. Each text was independently annotated by two annotators. For each text, the annotators had to identify segments referring to a scenario from a scenario list, and assign scenario labels. If a segment referred to more than one script, they were allowed to assign multiple labels. A scenario label could be either one of 200 scenarios or “None” to capture sentences that do not refer to any of the scenarios. The resource contains 504 documents, consisting of a total of 10,754 sentences. On average, each document is 35.74 sentences long.

Reference: Lilian D. A. Wanzare, Michael Roth & Manfred Pinkal (2019): Detecting Everyday Scenarios in Narrative texts. Proceedings of the Second Workshop of Storytelling@ACL, Florence, Italy.

Contact person: Michael Roth

Link to the resource

DFS Tools

A software package, written in Prolog, that implements the Distributional Formal Semantics framework.

Reference: Venhuizen, N. J., Hendriks, P., Crocker, M. W., and Brouwer, H. (2019). A Framework for Distributional Formal Semantics. In: Iemhoff, R., Moortgat, M., and de Queiroz, R. (Eds.), Logic, Language, Information, and Computation, Proceedings of the 26th International Workshop WoLLIC 2019, LNCS 11541, pp. 633-646. doi: 10.1007/978-3-662-59533-6_39 [PDF]

Contact person: Noortje Venhuizen

Link to the resource

DiscAlign

The resource contains alignments between the discourse annotations of the Penn Discourse Treebank (PDTB) and the RST-DT bank on the overlapping part of their texts.

Reference: Demberg, Vera, Fatemeh Asr, and Merel Scholman. DiscAlign for Penn and RST Discourse Treebanks LDC2021T16. . Philadelphia: Linguistic Data Consortium, 2021.

Contact person: Merel Scholman

Link to the resource

Disco-SPICE (Spoken conversations from the SPICE-Ireland corpus annotated with discourse relations)

The resource contains all texts from the Broadcast interview and Telephone conversation genres from the SPICE-Ireland corpus, annotated with discourse relations according to the PDTB 3.0 and CCR frameworks.

Reference: Rehbein, I., Scholman, M.C.J., Demberg, V. (2016). Annotating discourse relations in spoken language: A comparison of the PDTB and CCR frameworks. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Vera Demberg

Link to the resource

EPIC-UdS

EPIC-UdS is a trilingual parallel and comparable corpus of speeches held in the European Parliament (by MEPs) and follows the European Parliament Interpreting Corpora tradition of EPIC (Bologna) and EPICG (Ghent). It contains original speeches from 2008 to 2013 by English, German and Spanish native speakers and their interpretation (EN to/from DE, ES to EN).

Reference: Menzel, Katrin; Przybyl, Heike; Lapshinova-Koltunski, Ekaterina. EPIC-UdS – ein mehrsprachiges Korpus als Grundlage für die korpusbasierte Dolmetsch- und Übersetzungswissenschaft TRANSLATA IV – 4. Internationale Konferenz zur Translationswissenschaft, Innsbruck, 2021.

Contact person: Heike Przybyl

Link to the resource

EuroParl-UdS

The EuroParl-UdS corpus is a parallel corpus consisting of parliamentary debates of the European Parliament containing texts filtered based of native speakers. It is presently available for English, German and Spanish and the data is in plain text format. It contains texts of the European Parliament that were produced between 1999-2017. More specifically it consists of parallel (sentence-aligned) corpora for English into German and English into Spanish, where the source side contains texts only by native English speakers, and comparable monolingual corpora for English, German and Spanish, containing texts only by native speakers of each language.

Reference: Alina Karakanta, Mihaela Vela, and Elke Teich. 2018. EuroParl-Uds: Preserving and Extending Metadata in Parliamentary Debates. Proceedings of the LREC 2018. Miyazaki, Japan.

Contact person: Elke Teich

Link to the resource

Extended Sparky Restaurant corpus

A corpus for training natural language generation systems on the restaurant domain. The corpus was collected with two different target audiences in mind: adults and older adults.

Reference: Howcroft, David M.; Klakow, Dietrich; Demberg, Vera. The Extended SPaRKy Restaurant Corpus: Designing a Corpus with Variable Information Density, Proc. Interspeech 2017, pp. 3757-3761, 2017.

Contact person: David Howcroft

Link to the resource

G-TUNA corpus

A corpus of German referring expressions for furniture objects in different colours, sizes and orientation. Comparable datasets exist for the same set of images for other languages.

Reference: Howcroft, David M.; Vogels, Jorrig; Demberg, Vera. G-TUNA: a corpus of referring expressions in German, including duration information Proc. of the 10th International Natural Language Generation Conference (INLG), Association for Computational Linguistics, Santiago de Compostela, Spain, 2017.

Contact person: David Howcroft

Link to the resource

InScript (Narrative texts annotated with script information)

The InScript corpus contains a total of 1000 narrative texts crowdsourced via Amazon Mechanical Turk. The texts cover 10 different scenarios describing everyday situations like taking a bath, baking a cake etc. It is annotated with script information in the form of scenario-specific events and participants labels. The texts are also annotated with coreference chains linking different mentions of the same entity within the document.

Reference: Modi, A., Anikina, T. , Ostermann, S. & Pinkal, M. (2016). InScript: Narrative texts annotated with script information. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 16), Portorož, Slovenia.

Contact person: Stefan Thater

Link to the resource

MC-Saar-Instruct: Minecraft experimentation platform

MC-Saar-Instruct is a platform for running crowdsourcing experiments with Minecraft. It acts as a Minecraft server plugin and is tailored towards easily building systems that generate building instructions. Participants can log into the server with a standard Minecraft client.

Reference: Köhn, Arne; Wichlacz, Julia; Schäfer, Christine; Torralba, Álvaro; Hoffmann, Jörg; Koller, Alexander. MC-Saar-Instruct: a Platform for Minecraft Instruction Giving Agents Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, pp. 53-56, 1st virtual meeting, 2020.

Contact person: Alexander Koller

Link to the resource

MCScript

MCScript is a new dataset for the task of machine comprehension focussing on commonsense knowledge. Questions were collected based on script scenarios, rather than individual texts, which resulted in question–answer pairs that explicitly involve commonsense knowledge. It comprises 13,939 questions on 2,119 narrative texts and covers 110 different everyday scenarios. Each text is annotated with one of 110 scenarios. Questions are typed with a crowdsourced annotation, indicating whether they can be answered from the text or if commonsense knowledge is needed for finding an answer.

Reference: Ostermann, S., Modi, A., Roth, M., Thater, S., Pinkal, M. : MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.

Contact person: Michael Roth

Link to the resource

MCScript 2.0

MCScript 2.0 is a machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. The task is not challenging to humans, but existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base. Note: The download contains only the training and development data. The test data are not public as of May 2019, since the data set is used for a shared task at the COIN workshop (https://coinnlp.github.io/task1.html).

Reference: Ostermann, Simon ; Roth, Michael ; Pinkal, Manfred (2018): MCScript2.0: A Machine Comprehension Corpus Focused on Script Events and Participants. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM) , Minneapolis, USA, 2019.

Contact person: Michael Roth

Link to the resource

Modeling Semantic Expectations

This resource contains the DR predictions (by humans) on the InScript corpus. These were collected using Amazon Mechanical Turk. For details please refer to the paper mentioned below.

Reference: Modi, A., Titov, I., Demberg, D., Sayeed, A. & Pinkal, M. (2016). Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction. Transactions of Association for Computational Linguistics (TACL)

Contact person: Frances Yung

Link to the resource

Multilingual Parallel Direct Europarl

Multilingual Parallel Direct Europarl (MPDE) is a clean subset of the Europarl corpus that distinguishes translations that have been done without an intermediate language from those that might have used a pivot language. MPDE includes two sets, a comparable and a parallel corpus, with data aligned across multiple languages and scripts to adapt the extraction from Europarl to different languages.

Reference: Amponsah-Kaakyire, Kwabena; Pylypenko, Daria; España i Bonet, Cristina; van Genabith, Josef. Do not Rely on Relay Translations: Multilingual Parallel Direct Europarl Proceedings of the Workshop on Modelling Translation: Translatology in the Digital Age (MoTra21), International Committee on Computational Linguistics, pp. 1-7, Iceland (Online), 2021.

Contact person: Kwabena Amponsah-Kaakyire

Link to the resource

Naija-Lex

Naija-Lex is a lexicon of Nigerian Pidgin discourse connectives with English translations and the associated discourse relation senses. The resource was constructed by combining automatic discourse parsing with manual annotation and enables further research into discourse structure and coherence marking in Pidgin.

Reference: Marchal, M., Scholman, M.C.J., & Demberg, V. (2021). Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin. Proceedings of the Second Workshop on Computational Approaches to Discourse (CODI)

Contact person: Merel Scholman

Link to the resource

Name Agreement and Semantic Categorization of Young German Children: A Normed Database of 246 Colored Clipart Pictures

This repository provides ratings of name agreement and semantic categorization for 247 colored clipart pictures from 40 German native speaking children aged 4 to 6 years.

Reference: Sommerfeld, L., Staudte, M., & Kray, J. (2021). Ratings of Name Agreement and Semantic Categorization of 247 Colored Clipart Pictures by Young German Children [Manuscript submitted for publication]. Department of Psychology, University of Saarland, Germany. (under revision)

Contact person: Linda Sommerfeld

Link to the resource

N-gram Language Models based on DeWaC German web corpus

The resource is a set of language models at syllable and phone level. The context length (n-gram length) of the models ranges from 1 to 4 for syllable level and 1 to 6 for phone level. For each n-gram length, two model versions are provided: A forward version which contains the probability of a unit to occur given the preceding context, and a backward version which contains the probability of a unit to occur in the following context. The models were trained on the DeWaC German web corpus (Baroni and Kilgarriff 2006) using the SRILM language modeling toolkit (Stolcke 2002).”

Link to the resource

Royal Society Corpus Version 4.0

The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. The RSC Version 4 consists of approximately 32 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. We also annotate the two most important topics of each text according to a topic model consisting of 24 topics. The full topic model is also available for download. The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its average surprisal, i.e. the average amount of information it encodes in number of bits, with words as units and trigram as contexts [cf. Genzel and Charniak 2002). The release 4.0 of the corpus includes an improved OCR correction and removal of non-text tokens like formulæ and tables.

Reference: Hannah Kermes, Stefania Degaetano, Ashraf Khamis, Jörg Knappen, and Elke Teich. 2016. “The Royal Society Corpus: From Uncharted Data to Corpus.” Proceedings of the LREC 2016. Portoroz, Slovenia.

Contact person: Elke Teich

Link to the resource

Royal Society Corpus Version 6.0

The RSC 6.0 Open is an extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665-1996). It consists of approximately 78.6 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity.

Reference: Stefan Fischer, Jörg Knappen, Katrin Menzel, and Elke Teich. 2020. The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France.

Contact person: Elke Teich

Link to the resource

The Fragment Corpus (FraC)

A mixed-register corpus for the investigation of fragments, i.e. incomplete sentences, in German.

Reference: Horch, Eva and Ingo Reich. The Fragment Corpus. Proceedings of the 9th International Corpus Linguistics Conference, pp. 392-393, Birmingham, UK, 2017.

Contact person: Lisa Schäfer

Link to the resource