Publications

Ortmann, Katrin; Roussel, Adam; Dipper, Stefanie

Evaluating Off-the-Shelf NLP Tools for German Inproceedings

Proceedings of the Conference on Natural Language Processing (KONVENS), pp. 212-222, Erlangen, Germany, 2019.

It is not always easy to keep track of what toolsarecurrentlyavailableforaparticular annotation task, nor is it obvious how the provided models will perform on a given dataset. Inthiscontribution,weprovidean overview of the tools available for the automatic annotation of German-language text. We evaluate fifteen free and open source NLP tools for the linguistic annotation of German, looking at the fundamental NLP tasks of sentence segmentation, tokenization, POS tagging, morphological analysis, lemmatization, and dependency parsing. To get an idea of how the systems’ performance will generalize to various domains, we compiled our test corpus from various non-standard domains. All of the systems in our study are evaluated not only with respect to accuracy, but also the computational resources required.

@inproceedings{Ortmann2019b,
title = {Evaluating Off-the-Shelf NLP Tools for German},
author = {Katrin Ortmann and Adam Roussel and Stefanie Dipper},
url = {https://github.com/rubcompling/konvens2019},
year = {2019},
date = {2019},
booktitle = {Proceedings of the Conference on Natural Language Processing (KONVENS)},
pages = {212-222},
address = {Erlangen, Germany},
abstract = {It is not always easy to keep track of what toolsarecurrentlyavailableforaparticular annotation task, nor is it obvious how the provided models will perform on a given dataset. Inthiscontribution,weprovidean overview of the tools available for the automatic annotation of German-language text. We evaluate fifteen free and open source NLP tools for the linguistic annotation of German, looking at the fundamental NLP tasks of sentence segmentation, tokenization, POS tagging, morphological analysis, lemmatization, and dependency parsing. To get an idea of how the systems’ performance will generalize to various domains, we compiled our test corpus from various non-standard domains. All of the systems in our study are evaluated not only with respect to accuracy, but also the computational resources required.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C6

Jágrová, Klára; Stenger, Irina; Telus, Magdalena

Slavische Interkomprehension im 5-Sprachen-Kurs – Dokumentation eines Semesters Journal Article

Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkräfte. Sondernummer: Emil Krebs und die Mehrsprachigkeit in Europa, pp. 122–133, 2019.

@article{Jágrová2019,
title = {Slavische Interkomprehension im 5-Sprachen-Kurs – Dokumentation eines Semesters},
author = {Kl{\'a}ra J{\'a}grov{\'a} and Irina Stenger and Magdalena Telus},
year = {2019},
date = {2019},
journal = {Polnisch in Deutschland. Zeitschrift der Bundesvereinigung der Polnischlehrkr{\"a}fte. Sondernummer: Emil Krebs und die Mehrsprachigkeit in Europa},
pages = {122–133},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina

Zur Rolle der Orthographie in der slavischen Interkomprehension mit besonderem Fokus auf die kyrillische Schrift PhD Thesis

Saarland University, Saarbrücken, Germany, 2019, ISBN 978-3-86223-283-3.

Die slavischen Sprachen stellen einen bedeutenden indogermanischen Sprachzweig dar. Es stellt sich die Frage, inwieweit sich Sprecher verschiedener slavischer Sprachen interkomprehensiv verständigen können. Unter Interkomprehension wird die Kommunikationsfähigkeit von Sprechern verwandter Sprachen verstanden, wobei sich jeder Sprecher seiner Sprache bedient. Die vorliegende Arbeit untersucht die orthographische Verständlichkeit slavischer Sprachen mit kyrillischer Schrift im interkomprehensiven Lesen. Sechs ost- und südslavische Sprachen – Bulgarisch, Makedonisch, Russisch, Serbisch, Ukrainisch und Weißrussisch – werden im Hinblick auf orthographische Ähnlichkeiten und Unterschiede miteinander verglichen und statistisch analysiert. Der Fokus der empirischen Untersuchung liegt auf der Erkennung einzelner Kognaten mit diachronisch motivierten orthographischen Korrespondenzen in ost- und südslavischen Sprachen, ausgehend vom Russischen. Die in dieser Arbeit vorgestellten Methoden und erzielten Ergebnisse stellen einen empirischen Beitrag zur slavischen Interkomprehensionsforschung und Interkomrepehensionsdidaktik dar.

@phdthesis{Stenger_diss_2019,
title = {Zur Rolle der Orthographie in der slavischen Interkomprehension mit besonderem Fokus auf die kyrillische Schrift},
author = {Irina Stenger},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbr{\"u}cken, Germany},
abstract = {Die slavischen Sprachen stellen einen bedeutenden indogermanischen Sprachzweig dar. Es stellt sich die Frage, inwieweit sich Sprecher verschiedener slavischer Sprachen interkomprehensiv verst{\"a}ndigen k{\"o}nnen. Unter Interkomprehension wird die Kommunikationsf{\"a}higkeit von Sprechern verwandter Sprachen verstanden, wobei sich jeder Sprecher seiner Sprache bedient. Die vorliegende Arbeit untersucht die orthographische Verst{\"a}ndlichkeit slavischer Sprachen mit kyrillischer Schrift im interkomprehensiven Lesen. Sechs ost- und s{\"u}dslavische Sprachen - Bulgarisch, Makedonisch, Russisch, Serbisch, Ukrainisch und Wei{\ss}russisch - werden im Hinblick auf orthographische {\"A}hnlichkeiten und Unterschiede miteinander verglichen und statistisch analysiert. Der Fokus der empirischen Untersuchung liegt auf der Erkennung einzelner Kognaten mit diachronisch motivierten orthographischen Korrespondenzen in ost- und s{\"u}dslavischen Sprachen, ausgehend vom Russischen. Die in dieser Arbeit vorgestellten Methoden und erzielten Ergebnisse stellen einen empirischen Beitrag zur slavischen Interkomprehensionsforschung und Interkomrepehensionsdidaktik dar.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C4

Stenger, Irina; Avgustinova, Tania; Belousov, Konstantin I.; Baranov, Dmitrij A.; Erofeeva, Elena V.

Interaction of linguistic and socio-cognitive factors in receptive multilingualism [Vzaimodejstvie lingvističeskich i sociokognitivnych parametrov pri receptivnom mul’tilingvisme] Inproceedings

25th International Conference on Computational Linguistics and Intellectual Technologies (Dialogue 2019), Moscow, Russia, 2019.

@inproceedings{Stenger2019,
title = {Interaction of linguistic and socio-cognitive factors in receptive multilingualism [Vzaimodejstvie lingvisti{\v{c}eskich i sociokognitivnych parametrov pri receptivnom mul’tilingvisme]},
author = {Irina Stenger and Tania Avgustinova and Konstantin I. Belousov and Dmitrij A. Baranov and Elena V. Erofeeva},
url = {http://www.dialog-21.ru/digest/2019/online/},
year = {2019},
date = {2019},
booktitle = {25th International Conference on Computational Linguistics and Intellectual Technologies (Dialogue 2019)},
address = {Moscow, Russia},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C4

Calvillo, Jesús

Connectionist language production : distributed representations and the uniform information density hypothesis PhD Thesis

Saarland University, Saarbruecken, Germany, 2019.

This dissertation approaches the task of modeling human sentence production from a connectionist point of view, and using distributed semantic representations. The main questions it tries to address are: (i) whether the distributed semantic representations defined by Frank et al. (2009) are suitable to model sentence production using artificial neural networks, (ii) the behavior and internal mechanism of a model that uses this representations and recurrent neural networks, and (iii) a mechanistic account of the Uniform Information Density Hypothesis (UID; Jaeger, 2006; Levy and Jaeger, 2007). Regarding the first point, the semantic representations of Frank et al. (2009), called situation vectors are points in a vector space where each vector contains information about the observations in which an event and a corresponding sentence are true. These representations have been successfully used to model language comprehension (e.g., Frank et al., 2009; Venhuizen et al., 2018). During the construction of these vectors, however, a dimensionality reduction process introduces some loss of information, which causes some aspects to be no longer recognizable, reducing the performance of a model that utilizes them. In order to address this issue, belief vectors are introduced, which could be regarded as an alternative way to obtain semantic representations of manageable dimensionality. These two types of representations (situation and belief vectors) are evaluated using them as input for a sentence production model that implements an extension of a Simple Recurrent Neural network (Elman, 1990). This model was tested under different conditions corresponding to different levels of systematicity, which is the ability of a model to generalize from a set of known items to a set of novel ones. Systematicity is an essential attribute that a model of sentence processing has to possess, considering that the number of sentences that can be generated for a given language is infinite, and therefore it is not feasible to memorize all possible message-sentence pairs. The results showed that the model was able to generalize with a very high performance in all test conditions, demonstrating a systematic behavior. Furthermore, the errors that it elicited were related to very similar semantic representations, reflecting the speech error literature, which states that speech errors involve elements with semantic or phonological similarity. This result further demonstrates the systematic behavior of the model, as it processes similar semantic representations in a similar way, even if they are new to the model. Regarding the second point, the sentence production model was analyzed in two different ways. First, by looking at the sentences it produces, including the errors elicited, highlighting difficulties and preferences of the model. The results revealed that the model learns the syntactic patterns of the language, reflecting its statistical nature, and that its main difficulty is related to very similar semantic representations, sometimes producing unintended sentences that are however very semantically related to the intended ones. Second, the connection weights and activation patterns of the model were also analyzed, reaching an algorithmic account of the internal processing of the model. According to this, the input semantic representation activates the words that are related to its content, giving an idea of their order by providing relatively more activation to words that are likely to appear early in the sentence. Then, at each time step the word that was previously produced activates syntactic and semantic constraints on the next word productions, while the context units of the recurrence preserve information through time, allowing the model to enforce long distance dependencies. We propose that these results can inform about the internal processing of models with similar architecture. Regarding the third point, an extension of the model is proposed with the goal of modeling UID. According to UID, language production is an efficient process affected by a tendency to produce linguistic units distributing the information as uniformly as possible and close to the capacity of the communication channel, given the encoding possibilities of the language, thus optimizing the amount of information that is transmitted per time unit. This extension of the model approaches UID by balancing two different production strategies: one where the model produces the word with highest probability given the semantics and the previously produced words, and another one where the model produces the word that would minimize the sentence length given the semantic representation and the previously produced words. By combining these two strategies, the model was able to produce sentences with different levels of information density and uniformity, providing a first step to model UID at the algorithmic level of analysis. In sum, the results show that the distributed semantic representations of Frank et al. (2009) can be used to model sentence production, exhibiting systematicity. Moreover, an algorithmic account of the internal behavior of the model was reached, with the potential to generalize to other models with similar architecture. Finally, a model of UID is presented, highlighting some important aspects about UID that need to be addressed in order to go from the formulation of UID at the computational level of analysis to a mechanistic account at the algorithmic level.

@phdthesis{Calvillo_diss_2019,
title = {Connectionist language production : distributed representations and the uniform information density hypothesis},
author = {Jesús Calvillo},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-279340},
doi = {https://doi.org/http://dx.doi.org/10.22028/D291-27934},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {This dissertation approaches the task of modeling human sentence production from a connectionist point of view, and using distributed semantic representations. The main questions it tries to address are: (i) whether the distributed semantic representations defined by Frank et al. (2009) are suitable to model sentence production using artificial neural networks, (ii) the behavior and internal mechanism of a model that uses this representations and recurrent neural networks, and (iii) a mechanistic account of the Uniform Information Density Hypothesis (UID; Jaeger, 2006; Levy and Jaeger, 2007). Regarding the first point, the semantic representations of Frank et al. (2009), called situation vectors are points in a vector space where each vector contains information about the observations in which an event and a corresponding sentence are true. These representations have been successfully used to model language comprehension (e.g., Frank et al., 2009; Venhuizen et al., 2018). During the construction of these vectors, however, a dimensionality reduction process introduces some loss of information, which causes some aspects to be no longer recognizable, reducing the performance of a model that utilizes them. In order to address this issue, belief vectors are introduced, which could be regarded as an alternative way to obtain semantic representations of manageable dimensionality. These two types of representations (situation and belief vectors) are evaluated using them as input for a sentence production model that implements an extension of a Simple Recurrent Neural network (Elman, 1990). This model was tested under different conditions corresponding to different levels of systematicity, which is the ability of a model to generalize from a set of known items to a set of novel ones. Systematicity is an essential attribute that a model of sentence processing has to possess, considering that the number of sentences that can be generated for a given language is infinite, and therefore it is not feasible to memorize all possible message-sentence pairs. The results showed that the model was able to generalize with a very high performance in all test conditions, demonstrating a systematic behavior. Furthermore, the errors that it elicited were related to very similar semantic representations, reflecting the speech error literature, which states that speech errors involve elements with semantic or phonological similarity. This result further demonstrates the systematic behavior of the model, as it processes similar semantic representations in a similar way, even if they are new to the model. Regarding the second point, the sentence production model was analyzed in two different ways. First, by looking at the sentences it produces, including the errors elicited, highlighting difficulties and preferences of the model. The results revealed that the model learns the syntactic patterns of the language, reflecting its statistical nature, and that its main difficulty is related to very similar semantic representations, sometimes producing unintended sentences that are however very semantically related to the intended ones. Second, the connection weights and activation patterns of the model were also analyzed, reaching an algorithmic account of the internal processing of the model. According to this, the input semantic representation activates the words that are related to its content, giving an idea of their order by providing relatively more activation to words that are likely to appear early in the sentence. Then, at each time step the word that was previously produced activates syntactic and semantic constraints on the next word productions, while the context units of the recurrence preserve information through time, allowing the model to enforce long distance dependencies. We propose that these results can inform about the internal processing of models with similar architecture. Regarding the third point, an extension of the model is proposed with the goal of modeling UID. According to UID, language production is an efficient process affected by a tendency to produce linguistic units distributing the information as uniformly as possible and close to the capacity of the communication channel, given the encoding possibilities of the language, thus optimizing the amount of information that is transmitted per time unit. This extension of the model approaches UID by balancing two different production strategies: one where the model produces the word with highest probability given the semantics and the previously produced words, and another one where the model produces the word that would minimize the sentence length given the semantic representation and the previously produced words. By combining these two strategies, the model was able to produce sentences with different levels of information density and uniformity, providing a first step to model UID at the algorithmic level of analysis. In sum, the results show that the distributed semantic representations of Frank et al. (2009) can be used to model sentence production, exhibiting systematicity. Moreover, an algorithmic account of the internal behavior of the model was reached, with the potential to generalize to other models with similar architecture. Finally, a model of UID is presented, highlighting some important aspects about UID that need to be addressed in order to go from the formulation of UID at the computational level of analysis to a mechanistic account at the algorithmic level.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C3

Jachmann, Torsten; Drenhaus, Heiner; Staudte, Maria; Crocker, Matthew W.

Influence of speakers’ gaze on situated language comprehension: Evidence from Event-Related Potentials Journal Article

Brain and cognition, 135, Elsevier, pp. 103571, 2019.

Behavioral studies have shown that speaker gaze to objects in a co-present scene can influence listeners’ sentence comprehension. To gain deeper insight into the mechanisms involved in gaze processing and integration, we conducted two ERP experiments (N = 30, Age: [18, 32] and [19, 33] respectively). Participants watched a centrally positioned face performing gaze actions aligned to utterances comparing two out of three displayed objects. They were asked to judge whether the sentence was true given the provided scene. We manipulated the second gaze cue to be either Congruent (baseline), Incongruent or Averted (Exp1)/Mutual (Exp2). When speaker gaze is used to form lexical expectations about upcoming referents, we found an attenuated N200 when phonological information confirms these expectations (Congruent). Similarly, we observed attenuated N400 amplitudes when gaze-cued expectations (Congruent) facilitate lexical retrieval. Crucially, only a violation of gaze-cued lexical expectations (Incongruent) leads to a P600 effect, suggesting the necessity to revise the mental representation of the situation. Our results support the hypothesis that gaze is utilized above and beyond simply enhancing a cued object’s prominence. Rather, gaze to objects leads to their integration into the mental representation of the situation before they are mentioned.

@article{Jachmann2019b,
title = {Influence of speakers’ gaze on situated language comprehension: Evidence from Event-Related Potentials},
author = {Torsten Jachmann and Heiner Drenhaus and Maria Staudte and Matthew W. Crocker},
url = {https://www.sciencedirect.com/science/article/pii/S0278262619300120},
doi = {https://doi.org/10.1016/j.bandc.2019.05.009},
year = {2019},
date = {2019},
journal = {Brain and cognition},
pages = {103571},
publisher = {Elsevier},
volume = {135},
abstract = {Behavioral studies have shown that speaker gaze to objects in a co-present scene can influence listeners’ sentence comprehension. To gain deeper insight into the mechanisms involved in gaze processing and integration, we conducted two ERP experiments (N = 30, Age: [18, 32] and [19, 33] respectively). Participants watched a centrally positioned face performing gaze actions aligned to utterances comparing two out of three displayed objects. They were asked to judge whether the sentence was true given the provided scene. We manipulated the second gaze cue to be either Congruent (baseline), Incongruent or Averted (Exp1)/Mutual (Exp2). When speaker gaze is used to form lexical expectations about upcoming referents, we found an attenuated N200 when phonological information confirms these expectations (Congruent). Similarly, we observed attenuated N400 amplitudes when gaze-cued expectations (Congruent) facilitate lexical retrieval. Crucially, only a violation of gaze-cued lexical expectations (Incongruent) leads to a P600 effect, suggesting the necessity to revise the mental representation of the situation. Our results support the hypothesis that gaze is utilized above and beyond simply enhancing a cued object’s prominence. Rather, gaze to objects leads to their integration into the mental representation of the situation before they are mentioned.},
pubstate = {published},
type = {article}
}

Copy BibTeX to Clipboard

Projects:   A5 C3

Brandt, Erika

Information density and phonetic structure: explaining segmental variability PhD Thesis

Saarland University, Saarbruecken, Germany, 2019.

There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners‘ perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners‘ speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.

@phdthesis{Brandt_diss_2019,
title = {Information density and phonetic structure: explaining segmental variability},
author = {Erika Brandt},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-279181},
doi = {https://doi.org/10.22028/D291-27918},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {There is growing evidence that information-theoretic principles influence linguistic structures. Regarding speech several studies have found that phonetic structures lengthen in duration and strengthen in their spectral features when they are difficult to predict from their context, whereas easily predictable phonetic structures are shortened and reduced spectrally. Most of this evidence comes from studies on American English, only some studies have shown similar tendencies in Dutch, Finnish, or Russian. In this context, the Smooth Signal Redundancy hypothesis (Aylett and Turk 2004, Aylett and Turk 2006) emerged claiming that the effect of information-theoretic factors on the segmental structure is moderated through the prosodic structure. In this thesis, we investigate the impact and interaction of information density and prosodic structure on segmental variability in production analyses, mainly based on German read speech, and also listeners' perception of differences in phonetic detail caused by predictability effects. Information density (ID) is defined as contextual predictability or surprisal (S(unit_i) = -log2 P(unit_i|context)) and estimated from language models based on large text corpora. In addition to surprisal, we include word frequency, and prosodic factors, such as primary lexical stress, prosodic boundary, and articulation rate, as predictors of segmental variability in our statistical analysis. As acoustic-phonetic measures, we investigate segment duration and deletion, voice onset time (VOT), vowel dispersion, global spectral characteristics of vowels, dynamic formant measures and voice quality metrics. Vowel dispersion is analyzed in the context of German learners' speech and in a cross-linguistic study. As results, we replicate previous findings of reduced segment duration (and VOT), higher likelihood to delete, and less vowel dispersion for easily predictable segments. Easily predictable German vowels have less formant change in their vowel section length (VSL), F1 slope and velocity, are less curved in their F2, and show increased breathiness values in cepstral peak prominence (smoothed) than vowels that are difficult to predict from their context. Results for word frequency show similar tendencies: German segments in high-frequency words are shorter, more likely to delete, less dispersed, and show less magnitude in formant change, less F2 curvature, as well as less harmonic richness in open quotient smoothed than German segments in low-frequency words. These effects are found even though we control for the expected and much more effective effects of stress, boundary, and speech rate. In the cross-linguistic analysis of vowel dispersion, the effect of ID is robust across almost all of the six languages and the three intended speech rates. Surprisal does not affect vowel dispersion of non-native German speakers. Surprisal and prosodic factors interact in explaining segmental variability. Especially, stress and surprisal complement each other in their positive effect on segment duration, vowel dispersion and magnitude in formant change. Regarding perception we observe that listeners are sensitive to differences in phonetic detail stemming from high and low surprisal contexts for the same lexical target.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   C1

Brandt, Erika; Andreeva, Bistra; Möbius, Bernd

Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German Inproceedings

Proceedings of the 19th International Congress of Phonetic Sciences , pp. 3165-3169, Melbourne, Australia, 2019.

We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.

@inproceedings{Brandt2019,
title = {Information density and vowel dispersion in the productions of Bulgarian L2 speakers of German},
author = {Erika Brandt and Bistra Andreeva and Bernd M{\"o}bius},
url = {https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/29548},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 19th International Congress of Phonetic Sciences},
pages = {3165-3169},
address = {Melbourne, Australia},
abstract = {We investigated the influence of information density (ID) on vowel space size in L2. Vowel dispersion was measured for the stressed tense vowels /i:, o:, a:/ and their lax counterpart /I, O, a/ in read speech from six German speakers, six advanced and six intermediate Bulgarian speakers of German. The Euclidean distance between center of the vowel space and formant values for each speaker was used as a measure for vowel dispersion. ID was calculated as the surprisal of the triphone of the preceding context. We found a significant positive correlation between surprisal and vowel dispersion in German native speakers. The advanced L2 speakers showed a significant positive relationship between these two measures, while this was not observed in intermediate L2 vowel productions. The intermediate speakers raised their vowel space, reflecting native Bulgarian vowel raising in unstressed positions.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   C1

van Genabith, Josef; España-Bonet, Cristina; Lapshinova-Koltunski, Ekaterina

Analysing Coreference in Transformer Outputs Inproceedings

Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), Association for Computational Linguistics, pp. 1-12, Hong Kong, China, 2019.

We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

@inproceedings{lapshinovaEtal:2019iscoMT,
title = {Analysing Coreference in Transformer Outputs},
author = {Josef van Genabith and Cristina Espa{\~n}a-Bonet andEkaterina Lapshinova-Koltunski},
url = {https://www.aclweb.org/anthology/D19-6501},
doi = {https://doi.org/10.18653/v1/D19-6501},
year = {2019},
date = {2019},
booktitle = {Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)},
pages = {1-12},
publisher = {Association for Computational Linguistics},
address = {Hong Kong, China},
abstract = {We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B6

Biswas, Rajarshi; Mogadala, Aditya; Barz, Michael; Sonntag, Daniel; Klakow, Dietrich

Automatic Judgement of Neural Network-Generated Image Captions Inproceedings

7th International Conference on Statistical Language and Speech Processing (SLSP2019), 11816, Ljubljana, Slovenia, 2019.

Manual evaluation of individual results of natural language generation tasks is one of the bottlenecks. It is very time consuming and expensive if it is, for example, crowdsourced. In this work, we address this problem for the specific task of automatic image captioning. We automatically generate human-like judgements on grammatical correctness, image relevance and diversity of the captions obtained from a neural image caption generator. For this purpose, we use pool-based active learning with uncertainty sampling and represent the captions using fixed size vectors from Google’s Universal Sentence Encoder. In addition, we test common metrics, such as BLEU, ROUGE, METEOR, Levenshtein distance, and n-gram counts and report F1 score for the classifiers used under the active learning scheme for this task. To the best of our knowledge, our work is the first in this direction and promises to reduce time, cost, and human effort.

 

@inproceedings{Biswas2019,
title = {Automatic Judgement of Neural Network-Generated Image Captions},
author = {Rajarshi Biswas and Aditya Mogadala and Michael Barz and Daniel Sonntag and Dietrich Klakow},
url = {https://link.springer.com/chapter/10.1007/978-3-030-31372-2_22},
year = {2019},
date = {2019},
booktitle = {7th International Conference on Statistical Language and Speech Processing (SLSP2019)},
address = {Ljubljana, Slovenia},
abstract = {Manual evaluation of individual results of natural language generation tasks is one of the bottlenecks. It is very time consuming and expensive if it is, for example, crowdsourced. In this work, we address this problem for the specific task of automatic image captioning. We automatically generate human-like judgements on grammatical correctness, image relevance and diversity of the captions obtained from a neural image caption generator. For this purpose, we use pool-based active learning with uncertainty sampling and represent the captions using fixed size vectors from Google’s Universal Sentence Encoder. In addition, we test common metrics, such as BLEU, ROUGE, METEOR, Levenshtein distance, and n-gram counts and report F1 score for the classifiers used under the active learning scheme for this task. To the best of our knowledge, our work is the first in this direction and promises to reduce time, cost, and human effort.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Lange, Lukas; Hedderich, Michael; Klakow, Dietrich

Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels Inproceedings

Inui, Kentaro; Jiang, Jing; Ng, Vincent; Wan, Xiaojun (Ed.): Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, pp. 3552-3557, Hong Kong, China, 2019.

In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.

@inproceedings{lange-etal-2019-feature,
title = {Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels},
author = {Lukas Lange and Michael Hedderich and Dietrich Klakow},
editor = {Kentaro Inui and Jing Jiang and Vincent Ng and Xiaojun Wan},
url = {https://aclanthology.org/D19-1362/},
doi = {https://doi.org/10.18653/v1/D19-1362},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages = {3552-3557},
publisher = {Association for Computational Linguistics},
address = {Hong Kong, China},
abstract = {In low-resource settings, the performance of supervised labeling models can be improved with automatically annotated or distantly supervised data, which is cheap to create but often noisy. Previous works have shown that significant improvements can be reached by injecting information about the confusion between clean and noisy labels in this additional training data into the classifier training. However, for noise estimation, these approaches either do not take the input features (in our case word embeddings) into account, or they need to learn the noise modeling from scratch which can be difficult in a low-resource setting. We propose to cluster the training data using the input features and then compute different confusion matrices for each cluster. To the best of our knowledge, our approach is the first to leverage feature-dependent noise modeling with pre-initialized confusion matrices. We evaluate on low-resource named entity recognition settings in several languages, showing that our methods improve upon other confusion-matrix based methods by up to 9%.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B4

Reich, Ingo

Saulecker und supergemütlich! Pilotstudien zur fragmentarischen Verwendung expressiver Adjektive. Incollection

d'Avis, Franz; Finkbeiner, Rita (Ed.): Expressivität im Deutschen, De Gruyter, pp. 109-128, Berlin, Boston, 2019.

Schaut man auf dem Kika die „Jungs-WG“ oder „Durch die Wildnis“, dann ist gefühlt jede dritte Äußerung eine isolierte Verwendung eines expressiven Adjektivs der Art „Mega!. Ausgehend von dieser ersten impressionistischen Beobachtung wird in diesem Artikel sowohl korpuslinguistisch wie auch experimentell der Hypothese nachgegangen, dass expressive Adjektive in fragmentarischer Verwendung signifikant akzeptabler sind als deskriptive Adjektive. Während sich diese Hypothese im Korpus zunächst weitgehend bestätigt, zeigen die experimentellen Untersuchungen zwar, dass expressive Äußerungen generell besser bewertet werden als deskriptive Äußerungen, die ursprüngliche Hypothese lässt sich aber nicht bestätigen. Die Diskrepanz zwischen den korpuslinguistischen und den experimentellen Ergebnissen wird in der Folge auf eine Unterscheidung zwischen individuenbezogenen und äußerungsbezogenen (expressiven) Adjektiven zurückgeführt und festgestellt, dass die Korpusergebnisse die Verteilung äußerungsbezogener expressiver Adjektive nachzeichnen, während sich die Experimente alleine auf individuenbezogene (expressive) Adjektive beziehen. Die ursprüngliche Hypothese wäre daher in dem Sinne zu qualifizieren, dass sie nur Aussagen über die isolierte Verwendung äußerungsbezogener Adjektive macht.

@incollection{Reich2019,
title = {Saulecker und supergem{\"u}tlich! Pilotstudien zur fragmentarischen Verwendung expressiver Adjektive.},
author = {Ingo Reich},
editor = {Franz d'Avis and Rita Finkbeiner},
url = {https://www.degruyter.com/document/doi/10.1515/9783110630190-005/html},
doi = {https://doi.org/10.1515/9783110630190-005},
year = {2019},
date = {2019},
booktitle = {Expressivit{\"a}t im Deutschen},
pages = {109-128},
publisher = {De Gruyter},
address = {Berlin, Boston},
abstract = {Schaut man auf dem Kika die „Jungs-WG“ oder „Durch die Wildnis“, dann ist gef{\"u}hlt jede dritte {\"A}u{\ss}erung eine isolierte Verwendung eines expressiven Adjektivs der Art „Mega!. Ausgehend von dieser ersten impressionistischen Beobachtung wird in diesem Artikel sowohl korpuslinguistisch wie auch experimentell der Hypothese nachgegangen, dass expressive Adjektive in fragmentarischer Verwendung signifikant akzeptabler sind als deskriptive Adjektive. W{\"a}hrend sich diese Hypothese im Korpus zun{\"a}chst weitgehend best{\"a}tigt, zeigen die experimentellen Untersuchungen zwar, dass expressive {\"A}u{\ss}erungen generell besser bewertet werden als deskriptive {\"A}u{\ss}erungen, die urspr{\"u}ngliche Hypothese l{\"a}sst sich aber nicht best{\"a}tigen. Die Diskrepanz zwischen den korpuslinguistischen und den experimentellen Ergebnissen wird in der Folge auf eine Unterscheidung zwischen individuenbezogenen und {\"a}u{\ss}erungsbezogenen (expressiven) Adjektiven zur{\"u}ckgef{\"u}hrt und festgestellt, dass die Korpusergebnisse die Verteilung {\"a}u{\ss}erungsbezogener expressiver Adjektive nachzeichnen, w{\"a}hrend sich die Experimente alleine auf individuenbezogene (expressive) Adjektive beziehen. Die urspr{\"u}ngliche Hypothese w{\"a}re daher in dem Sinne zu qualifizieren, dass sie nur Aussagen {\"u}ber die isolierte Verwendung {\"a}u{\ss}erungsbezogener Adjektive macht.},
pubstate = {published},
type = {incollection}
}

Copy BibTeX to Clipboard

Project:   B3

Scholman, Merel

Coherence relations in discourse and cognition: comparing approaches, annotations, and interpretations PhD Thesis

Saarland University, Saarbruecken, Germany, 2019.

When readers comprehend a discourse, they do not merely interpret each clause or sentence separately; rather, they assign meaning to the text by creating semantic links between the clauses and sentences. These links are known as coherence relations (cf. Hobbs, 1979; Sanders, Spooren & Noordman, 1992). If readers are not able to construct such relations between the clauses and sentences of a text, they will fail to fully understand that text. Discourse coherence is therefore crucial to natural language comprehension in general. Most frameworks that propose inventories of coherence relation types agree on the existence of certain coarse-grained relation types, such as causal relations (relations types belonging to the causal class include Cause or Result relations), and additive relations (e.g., Conjunctions or Specifications). However, researchers often disagree on which finer-grained relation types hold and, as a result, there is no uniform set of relations that the community has agreed on (Hovy & Maier, 1995). Using a combination of corpus-based studies and off-line and on-line experimental methods, the studies reported in this dissertation examine distinctions between types of relations. The studies are based on the argument that coherence relations are cognitive entities, and distinctions of coherence relation types should therefore be validated using observations that speak to both the descriptive adequacy and the cognitive plausibility of the distinctions. Various distinctions between relation types are investigated on several levels, corresponding to the central challenges of the thesis. First, the distinctions that are made in approaches to coherence relations are analysed by comparing the relational classes and assessing the theoretical correspondences between the proposals. An interlingua is developed that can be used to map relational labels from one approach to another, therefore improving the interoperability between the different approaches. Second, practical correspondences between different approaches are studied by evaluating datasets containing coherence relation annotations from multiple approaches. A comparison of the annotations from different approaches on the same data corroborate the interlingua, but also reveal systematic patterns of discrepancies between the frameworks that are caused by different operationalizations. Finally, in the experimental part of the dissertation, readers’ interpretations are investigated to determine whether readers are able to distinguish between specific types of relations that cause the discrepancies between approaches. Results from off-line and online studies provide insight into readers’ interpretations of multi-interpretable relations, individual differences in interpretations, anticipation of discourse structure, and distributional differences between languages on readers’ processing of discourse. In sum, the studies reported in this dissertation contribute to a more detailed understanding of which types of relations comprehenders construct and how these relations are inferred and processed.

@phdthesis{Scholman_diss_2019,
title = {Coherence relations in discourse and cognition: comparing approaches, annotations, and interpretations},
author = {Merel Scholman},
url = {http://nbn-resolving.de/urn:nbn:de:bsz:291--ds-278687},
doi = {https://doi.org/http://dx.doi.org/10.22028/D291-27868},
year = {2019},
date = {2019},
school = {Saarland University},
address = {Saarbruecken, Germany},
abstract = {When readers comprehend a discourse, they do not merely interpret each clause or sentence separately; rather, they assign meaning to the text by creating semantic links between the clauses and sentences. These links are known as coherence relations (cf. Hobbs, 1979; Sanders, Spooren & Noordman, 1992). If readers are not able to construct such relations between the clauses and sentences of a text, they will fail to fully understand that text. Discourse coherence is therefore crucial to natural language comprehension in general. Most frameworks that propose inventories of coherence relation types agree on the existence of certain coarse-grained relation types, such as causal relations (relations types belonging to the causal class include Cause or Result relations), and additive relations (e.g., Conjunctions or Specifications). However, researchers often disagree on which finer-grained relation types hold and, as a result, there is no uniform set of relations that the community has agreed on (Hovy & Maier, 1995). Using a combination of corpus-based studies and off-line and on-line experimental methods, the studies reported in this dissertation examine distinctions between types of relations. The studies are based on the argument that coherence relations are cognitive entities, and distinctions of coherence relation types should therefore be validated using observations that speak to both the descriptive adequacy and the cognitive plausibility of the distinctions. Various distinctions between relation types are investigated on several levels, corresponding to the central challenges of the thesis. First, the distinctions that are made in approaches to coherence relations are analysed by comparing the relational classes and assessing the theoretical correspondences between the proposals. An interlingua is developed that can be used to map relational labels from one approach to another, therefore improving the interoperability between the different approaches. Second, practical correspondences between different approaches are studied by evaluating datasets containing coherence relation annotations from multiple approaches. A comparison of the annotations from different approaches on the same data corroborate the interlingua, but also reveal systematic patterns of discrepancies between the frameworks that are caused by different operationalizations. Finally, in the experimental part of the dissertation, readers’ interpretations are investigated to determine whether readers are able to distinguish between specific types of relations that cause the discrepancies between approaches. Results from off-line and online studies provide insight into readers’ interpretations of multi-interpretable relations, individual differences in interpretations, anticipation of discourse structure, and distributional differences between languages on readers’ processing of discourse. In sum, the studies reported in this dissertation contribute to a more detailed understanding of which types of relations comprehenders construct and how these relations are inferred and processed.},
pubstate = {published},
type = {phdthesis}
}

Copy BibTeX to Clipboard

Project:   B2

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Challenges of parsing a historical corpus of Scientific English Miscellaneous

Historical Corpora and Variation (Book of Abstracts), Cagliari, Italy, 2019.

In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and
model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.

@miscellaneous{Juzek2019a,
title = {Challenges of parsing a historical corpus of Scientific English},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://convegni.unica.it/hicov/files/2019/01/Juzek-et-al.pdf},
year = {2019},
date = {2019},
booktitle = {Historical Corpora and Variation (Book of Abstracts)},
address = {Cagliari, Italy},
abstract = {In this contribution, we outline our experiences with syntactically parsing a diachronic historical corpus. We report on how errors like OCR inaccuracies, end-of-sentence inaccuracies, etc. propagate bottom-up and how we approach such errors by building on existing machine learning approaches for error correction. The Royal Society Corpus (RSC; Kermes et al. 2016) is a collection of scientific text from 1665 to 1869 and contains ca. 10 000 documents and 30 million tokens. Using the RSC, we wish to describe and model how syntactic complexity changes as Scientific English of the late modern period develops. Our focus is on how common measures of syntactic complexity, e.g. length in tokens, embedding depth, and number of dependants, relate to estimates of information content. Our hypothesis is that Scientific English develops towards the use of shorter sentences with fewer clausal embeddings and increasingly complex noun phrases over time, in order to accommodate an expansion on the lexical level.},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Juzek, Tom; Fischer, Stefan; Krielke, Marie-Pauline; Degaetano-Ortlieb, Stefania; Teich, Elke

Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches Miscellaneous

52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts), 2019.

@miscellaneous{Juzek2019,
title = {Annotation quality assessment and error correction in diachronic corpora: Combining pattern-based and machine learning approaches},
author = {Tom Juzek and Stefan Fischer and Marie-Pauline Krielke and Stefania Degaetano-Ortlieb and Elke Teich},
year = {2019},
date = {2019},
booktitle = {52nd Annual Meeting of the Societas Linguistica Europaea (Book of Abstracts)},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Menzel, Katrin; Teich, Elke

Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century Book Chapter

Moskowich, Isabel; Crespo, Begoña; Puente-Castelo, Luis; Maria Monaco, Leida (Ed.): Writing History in Late Modern English: Explorations of the Coruña Corpus, John Benjamins, pp. 58-81, Amsterdam, 2019.

@inbook{Degaetano-Ortlieb2019b,
title = {Typical linguistic patterns of English history texts from the eighteenth to the nineteenth century},
author = {Stefania Degaetano-Ortlieb and Katrin Menzel and Elke Teich},
editor = {Isabel Moskowich and Bego{\~n}a Crespo and Luis Puente-Castelo and Leida Maria Monaco},
url = {https://benjamins.com/catalog/z.225.04deg},
year = {2019},
date = {2019},
booktitle = {Writing History in Late Modern English: Explorations of the Coru{\~n}a Corpus},
pages = {58-81},
publisher = {John Benjamins},
address = {Amsterdam},
pubstate = {published},
type = {inbook}
}

Copy BibTeX to Clipboard

Project:   B1

Krielke, Marie-Pauline; Fischer, Stefan; Degaetano-Ortlieb, Stefania; Teich, Elke

System and use of wh-relativizers in 200 years of English scientific writing Miscellaneous

10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie
2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).

@miscellaneous{Krielke2019b,
title = {System and use of wh-relativizers in 200 years of English scientific writing},
author = {Marie-Pauline Krielke and Stefan Fischer and Stefania Degaetano-Ortlieb and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/cl2019_paper_266.pdf},
year = {2019},
date = {2019},
booktitle = {10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We investigate the diachronic development of wh-relativizers in English scientific writing in the late modern period, characterized by an initially richly populated paradigm in the late 17th/early 18th century and a reduction to only a few options by the mid 19th century. To explain this reduction, we take the perspective of rational communication, according to which language users, while striving for successful communication, seek to reduce their effort. Previous work has shown that production effort is directly linked to the number of options at a given choice point (Milin et al. 2009, Linzen and Jaeger 2016). This effort is appropriately indexed by entropy: The more options with equal/similar probability, the higher the entropy, i.e. the higher the production effort. Similarly, processing effort is correlated with predictability in context – surprisal (Levy 2008). Highly predictable, conventionalized patterns are easier to produce and comprehend than less predictable ones. Assuming that language users strive for ease in communication, diachronically they are likely to (a) develop a preference for which options to use and discard others to reduce entropy, and (b) converge on how to use those options to reduce surprisal. We test this for the changing use of wh-relativizers in scientific text in the late modern period. Many scholars have investigated variation in relativizer choice in standard spoken and written varieties (e.g. Guy and Bayley 1995; Biber et al. 1999; Lehmann 2001; Hinrichs et al. 2015), in vernacular speech (e.g. Romaine 1982, Tottie and Harvie 2000; Tagliamonte 2002; Tagliamonte et al. 2005; Levey 2006), and from synchronic and diachronic perspectives (e.g. Romaine 1980; Ball 1996; Hundt et al. 2012; Nevalainen 2012, Nevalainen and Raumolin-Brunberg 2002). While stylistic variability of the different options in written present day English is well known (see Biber et al. 1999; Leech et al. 2009), we know little about the diachronic development of relativizers according to register, e.g. in scientific writing. Also, most research only considers most common relativizers (e.g. which, that, zero) still in use in present day English. Here, we study a more comprehensive set of relativizers across scientific and “general language” (mix of registers) from a diachronic perspective. Possible paradigmatic change is analyzed by diachronic word embeddings (cf. Fankhauser and Kupietz 2017), allowing us to select items affected by change. Then we assess the change (reduction/expansion) of a paradigm estimating its entropy over time. To check whether changes are specific to scientific language, we compare with uses in general language. Finally, we inspect possible changes in the predictability of selected wh-relativizers involved in paradigmatic change estimating their surprisal over time, looking for traces of conventionalization (cf. Degaetano-Ortlieb and Teich 2016, 2018).},
pubstate = {published},
type = {miscellaneous}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Krielke, Marie-Pauline; Scheurer, Franziska; Teich, Elke

A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.

@inproceedings{Degaetano-Ortlieb2019b,
title = {A diachronic perspective on efficiency in language use: that-complement clause in academic writing across 300 years},
author = {Stefania Degaetano-Ortlieb and Marie-Pauline Krielke and Franziska Scheurer and Elke Teich},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstract_that-comp_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {Efficiency in language use and the role of predictability in context have attracted many researchers from different fields (Zipf 1949; Landau 1969; Fidelholtz 1975, Jurafsky et al. 1998; Bybee and Scheibman 1999; Genzel and Charniak 2002; Aylett and Turk 2004; Hawkins 2004; Piantadosi et al. 2009, Jaeger 2010). The analysis of reduction processes, where linguistic units are reduced/omitted has enhanced our knowledge on efficiency in communication. Possible factors affecting retention or omission of an optional element include discourse context (cf. Thompson and Mulac 1991), the amount of information a unit transmits given its context (known as surprisal, cf. Jaeger 2010) or the complexity of the syntagmatic environment (Rohdenburg 1998). So far, the role change in language use plays has been less considered.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania

Hybridization effects in literary texts Inproceedings

Proceedings of the 10th International Corpus Linguistics Conference, Cardiff, Wales, UK, 2019.

We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features.

(1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823).

Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.

@inproceedings{Degaetano-Ortlieb2019b,
title = {Hybridization effects in literary texts},
author = {Stefania Degaetano-Ortlieb},
url = {https://stefaniadegaetano.files.wordpress.com/2019/05/abstact_cl2019_hybridization_final.pdf},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 10th International Corpus Linguistics Conference},
address = {Cardiff, Wales, UK},
abstract = {We present an analysis of subregisters, whose differentiation is still a difficult task due to their hybridity reflected in conforming to a presumed “norm” and encompassing something “new”. We focus on texts at the interface between what Halliday (2002: 177) calls two opposite “cultures”, literature and science (here: science fiction texts). Texts belonging to one register will exhibit similar choices of lexico-grammatical features. Hybrid texts at the intersection between two registers will reflect a mixture of particular features (cf. Degaetano-Ortlieb et al. 2014, Biber et al. 2015, Teich et al. 2013, 2016, Underwood 2016). Consider example (1) taken from Mary Shelley’s Frankenstein. While traditionally grounded as a literary text, it shows a registerial nuance from the influential register of science. This encompasses phrases (bold) also found in scientific articles from that period (e.g. in the Royal Society Corpus, cf. Kermes et al. 2016), verbs related to scientific endeavor (e.g. become acquainted, examine, observe, discover), and scientific terminology (e.g. anatomy, decay, corruption, vertebrae, inflammable air) packed into complex nominal phrases (underlined). Note that features marking this registerial nuance include not only lexical but also grammatical features. (1) I became acquainted with the science of anatomy, but this was not sufficient; I must also observe the natural decay and corruption of the human body. […] Now I was led to examine the cause and progress of this decay. I succeeded in discovering the cause of generation and life. (Frankenstein, Mary Shelley, 1818/1823). Thus, we hypothesize that hybrid registers while mainly resembling their traditional register in the use of lexico-grammatical features (H1 register resemblance), will also show particular lexico-grammatical nuances of their influential register (H2 registerial nuance). In particular, we are interested in (a) variation across registers to see which lexico-grammatical features are involved in hybridization effects and (b) intra-textual variation (e.g. across chapters) to analyze in which parts of a text hybridization effects are most prominent.},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Degaetano-Ortlieb, Stefania; Piper, Andrew

The Scientization of Literary Study Inproceedings

Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019, Association for Computational Linguistics, pp. 18-28, Minneapolis, MN, USA, 2019.

Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).

@inproceedings{degaetano-ortlieb-piper-2019-scientization,
title = {The Scientization of Literary Study},
author = {Stefania Degaetano-Ortlieb and Andrew Piper},
url = {https://aclanthology.org/W19-2503},
doi = {https://doi.org/10.18653/v1/W19-2503},
year = {2019},
date = {2019},
booktitle = {Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature at NAACL 2019},
pages = {18-28},
publisher = {Association for Computational Linguistics},
address = {Minneapolis, MN, USA},
abstract = {Scholarly practices within the humanities have historically been perceived as distinct from the natural sciences. We look at literary studies, a discipline strongly anchored in the humanities, and hypothesize that over the past half-century literary studies has instead undergone a process of “scientization”, adopting linguistic behavior similar to the sciences. We test this using methods based on information theory, comparing a corpus of literary studies articles (around 63,400) with a corpus of standard English and scientific English respectively. We show evidence for “scientization” effects in literary studies, though at a more muted level than scientific English, suggesting that literary studies occupies a middle ground with respect to standard English in the larger space of academic disciplines. More generally, our methodology can be applied to investigate the social positioning and development of language use across different domains (e.g. scientific disciplines, language varieties, registers).},
pubstate = {published},
type = {inproceedings}
}

Copy BibTeX to Clipboard

Project:   B1

Successfully