This dataset enlists a large variety of human mishearings of synonyms in noise, when they were presented with linguistic context (91 pairs) as well as in isolation (189 pairs). The annotations were collected from a group of 126 native British English speakers with normal hearing thresholds.
Contact person: Anupama Chingacham
Link to the resource
C6C is a Python pipeline to convert different historical corpora from various input formats to common, standardized output format(s). It also allows for custom processing of the input data, e.g., to map historical POS tagsets to the modern standard tagset.
Contact person: Stefanie Dipper
Link to the resource
The corpus contains manual annotations of sentence brackets (Wöllstein 2014), extraposed and adjacent relative clauses with their antecedent, position and type of restrictiveness, as well as extraposed prepositional phrases, nominal phrases, adjective phrases, adverbial
phrases and comparative elements with an embedded counterpart to allow comparison.
Contact person: Stefanie Dipper
Link to the resource
DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. It has 40 scenarios with approximately 100 ESDs each. The corpus also has partial alignments of event descriptions that are semantically similar with respect to the given scenario.
Contact person: Stefan Thater
Link to the resource
This resource contains sentence level annotations, with sentences (segments) labeled according to the scripts they instantiate. Each text was independently annotated by two annotators. For each text, the annotators had to identify segments referring to a scenario from a scenario list, and assign scenario labels. If a segment referred to more than one script, they were allowed to assign multiple labels. A scenario label could be either one of 200 scenarios or “None” to capture sentences that do not refer to any of the scenarios. The resource contains 504 documents, consisting of a total of 10,754 sentences. On average, each document is 35.74 sentences long.
Contact person: Michael Roth
Link to the resource
A software package, written in Prolog, that implements the Distributional Formal Semantics framework.
Reference: Venhuizen, N. J., Hendriks, P., Crocker, M. W., and Brouwer, H. (2019). A Framework for Distributional Formal Semantics. In: Iemhoff, R., Moortgat, M., and de Queiroz, R. (Eds.), Logic, Language, Information, and Computation, Proceedings of the 26th International Workshop WoLLIC 2019, LNCS 11541, pp. 633-646. doi: 10.1007/978-3-662-59533-6_39 [PDF]
Contact person: Noortje Venhuizen
Link to the resource
The resource contains alignments between the discourse annotations of the Penn Discourse Treebank (PDTB) and the RST-DT bank on the overlapping part of their texts.
Contact person: Merel Scholman
Link to the resource
The resource contains all texts from the Broadcast interview and Telephone conversation genres from the SPICE-Ireland corpus, annotated with discourse relations according to the PDTB 3.0 and CCR frameworks.
Contact person: Vera Demberg
Link to the resource
EPIC-UdS is a trilingual parallel and comparable corpus of speeches held in the European Parliament (by MEPs) and follows the European Parliament Interpreting Corpora tradition of EPIC (Bologna) and EPICG (Ghent). It contains original speeches from 2008 to 2013 by English, German and Spanish native speakers and their interpretation (EN to/from DE, ES to EN).
Contact person: Heike Przybyl
Link to the resource
The EuroParl-UdS corpus is a parallel corpus consisting of parliamentary debates of the European Parliament containing texts filtered based of native speakers. It is presently available for English, German and Spanish and the data is in plain text format. It contains texts of the European Parliament that were produced between 1999-2017. More specifically it consists of parallel (sentence-aligned) corpora for English into German and English into Spanish, where the source side contains texts only by native English speakers, and comparable monolingual corpora for English, German and Spanish, containing texts only by native speakers of each language.
Contact person: Elke Teich
Link to the resource
A corpus for training natural language generation systems on the restaurant domain. The corpus was collected with two different target audiences in mind: adults and older adults.
Contact person: David Howcroft
Link to the resource
A corpus of German referring expressions for furniture objects in different colours, sizes and orientation. Comparable datasets exist for the same set of images for other languages.
Contact person: David Howcroft
Link to the resource
The InScript corpus contains a total of 1000 narrative texts crowdsourced via Amazon Mechanical Turk. The texts cover 10 different scenarios describing everyday situations like taking a bath, baking a cake etc. It is annotated with script information in the form of scenario-specific events and participants labels. The texts are also annotated with coreference chains linking different mentions of the same entity within the document.
Contact person: Stefan Thater
Link to the resource
MC-Saar-Instruct is a platform for running crowdsourcing experiments with Minecraft. It acts as a Minecraft server plugin and is tailored towards easily building systems that generate building instructions. Participants can log into the server with a standard Minecraft client.
Contact person: Alexander Koller
Link to the resource
MCScript is a new dataset for the task of machine comprehension focussing on commonsense knowledge. Questions were collected based on script scenarios, rather than individual texts, which resulted in question–answer pairs that explicitly involve commonsense knowledge. It comprises 13,939 questions on 2,119 narrative texts and covers 110 different everyday scenarios. Each text is annotated with one of 110 scenarios. Questions are typed with a crowdsourced annotation, indicating whether they can be answered from the text or if commonsense knowledge is needed for finding an answer.
Contact person: Michael Roth
MCScript 2.0 is a machine comprehension corpus for the end-to-end evaluation of script knowledge. It contains approx. 20,000 questions on approx. 3,500 texts, crowdsourced based on a new collection process that results in challenging questions. Half of the questions cannot be answered from the reading texts, but require the use of commonsense and, in particular, script knowledge. The task is not challenging to humans, but existing machine comprehension models fail to perform well on the data, even if they make use of a commonsense knowledge base. Note: The download contains only the training and development data. The test data are not public as of May 2019, since the data set is used for a shared task at the COIN workshop (https://coinnlp.github.io/task1.html).
Contact person: Michael Roth
Link to the resource
This resource contains the DR predictions (by humans) on the InScript corpus. These were collected using Amazon Mechanical Turk. For details please refer to the paper mentioned below.
Contact person: Frances Yung
Link to the resource
Multilingual Parallel Direct Europarl (MPDE) is a clean subset of the Europarl corpus that distinguishes translations that have been done without an intermediate language from those that might have used a pivot language. MPDE includes two sets, a comparable and a parallel corpus, with data aligned across multiple languages and scripts to adapt the extraction from Europarl to different languages.
Contact person: Kwabena Amponsah-Kaakyire
Link to the resource
Naija-Lex is a lexicon of Nigerian Pidgin discourse connectives with English translations and the associated discourse relation senses. The resource was constructed by combining automatic discourse parsing with manual annotation and enables further research into discourse structure and coherence marking in Pidgin.
Reference: Marchal, M., Scholman, M.C.J., & Demberg, V. (2021). Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin. Proceedings of the Second Workshop on Computational Approaches to Discourse (CODI)
Contact person: Merel Scholman
Link to the resource
This repository provides ratings of name agreement and semantic categorization for 247 colored clipart pictures from 40 German native speaking children aged 4 to 6 years.
Reference: Sommerfeld, L., Staudte, M., & Kray, J. (2021). Ratings of Name Agreement and Semantic Categorization of 247 Colored Clipart Pictures by Young German Children [Manuscript submitted for publication]. Department of Psychology, University of Saarland, Germany. (under revision)
Contact person: Linda Sommerfeld
Link to the resource
The resource is a set of language models at syllable and phone level. The context length (n-gram length) of the models ranges from 1 to 4 for syllable level and 1 to 6 for phone level. For each n-gram length, two model versions are provided: A forward version which contains the probability of a unit to occur given the preceding context, and a backward version which contains the probability of a unit to occur in the following context. The models were trained on the DeWaC German web corpus (Baroni and Kilgarriff 2006) using the SRILM language modeling toolkit (Stolcke 2002).”
Link to the resource
The Royal Society Corpus (RSC) is based on the first two centuries of the Philosophical Transactions of the Royal Society of London from its beginning in 1665 to 1869. It includes all publications of the journal written mainly in English and containing running text. The Philosophical Transactions was the first periodical of scientific writing in England. The RSC Version 4 consists of approximately 32 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity. We also annotate the two most important topics of each text according to a topic model consisting of 24 topics. The full topic model is also available for download. The corpus is tokenized and linguistically annotated for lemma and part-of-speech using TreeTagger (Schmid 1994, Schmid 1995). For spelling normalization we use a trained model of VARD (Baron and Rayson 2008). As a special feature, we encode with each unit (word token) its average surprisal, i.e. the average amount of information it encodes in number of bits, with words as units and trigram as contexts [cf. Genzel and Charniak 2002). The release 4.0 of the corpus includes an improved OCR correction and removal of non-text tokens like formulæ and tables.
Contact person: Elke Teich
Link to the resource
The RSC 6.0 Open is an extended version of the Royal Society Corpus (RSC), a diachronic corpus of scientific English now covering 300+ years of scientific writing (1665-1996). It consists of approximately 78.6 million tokens and is encoded for text type (abstracts, articles), author, year of publication. Information about decade and 50-year periods are also available allowing for a diachronic analysis of different granularity.
Contact person: Elke Teich
Link to the resource
A mixed-register corpus for the investigation of fragments, i.e. incomplete sentences, in German.
Contact person: Lisa Schäfer
Link to the resource