**The Statistics of Non-Linguistic Symbol Systems**

Richard Sproat

Google, Tokyo

For 5000 years humans have been using visible marks to encode spoken language. For a far longer period, they have been using visible marks to encode concepts, ideas or, in general, a variety of non-linguistic information. When faced with an ancient symbol system whose meaning is unknown, can one tell if was linguistic (and therefore worth trying to decipher as a language), or some sort of non-linguistic system?

On the face of it, it seems reasonable to use as evidence statistical information on the behavior of symbols in the system. If the symbols distribute in a way that is similar to the distribution of elements (phonemes, morphemes, words, etc) in language, then this could serve as evidence that the system is writing. In causal terms, the fact that it is writing causes the system to show the statistical properties it has.

Recent work that has used this line of argumentation suffers from a variety of problems. First, while such work invariably makes the claim that the statistical measures used are evidence for structure, often the measures actually tell us little or nothing about structure. Second, even if the measures do relate to structure, do they specifically imply /linguistic/ structure? A parse tree looks very similar to a tree that describes the structure of a mathematical formula, so structure per se hardly seems enough. This leads to a third problem with such work in that it depends to some degree on a widespread misconception that non-linguistic systems are structureless. Finally there is the question of whether sample sizes for such systems are ever large enough to make robust statistical claims.

In this talk I review the results of my own work on the statistics of non-linguistic symbol systems, and draw a mostly negative conclusion about the possibility of finding statistical measures that are useful in answering this question.

If you would like to meet with the speaker, please contact Vera Demberg.