MERGEmultidim: recent developments for a sequential and largely information-theoretic approach to multi-‚word‘ units - Speaker: Stefan Th. Gries

In this talk, I will discuss current work developing and refining as well as exploring an algorithm to detect multi-word units. The current implementation of this algorithm differs from much previous work in several ways:

− it is iterative in how it considers potentially all (right now only contiguous) bigrams of a corpus and merges the bigram scoring highest in ‚MWU-ness‘;
− it is very multidimensional in nature in how the MWU-ness scores are derived from up to eight quantitative corpus-linguistic dimensions that correspond to or, ideally, measure different cognitive/psycholinguistic notions: (i) frequency of occurrence, (ii) dispersion of occurrence, (iii) and (iv) association strength (2 directions), (v) and (vi) type frequencies (2 directions), and (vii) and (viii) entropy-upon-removal (2 directions);
− it tries to measure the different dimensions as orthogonally as possible and it is largely information-theoretic in nature in how it uses frequencies or versions of the Kullback-Leibler divergence for scoring.

In the talk, I will outline the algorithm in detail, i.e. I will motivate its dimensions, illustrate how they are corpus-linguistically and quantitatively operationalized, provide a brief look at my implementation of it in R (using base R, data.table, and Rcpp; Chadi Ben Youssef, a Ph.D. student of mine, just completed a faster Julia implementation, which will be made available soon), what its initial results are (from an application of it to the Brown corpus, and, maybe most importantly, where the algorithm still needs work.