Why are Zipfian distributions found in language?

This post is a scholarly addendum to today's main post, aimed to satisfy the curiosity of my academic readers.  I'm going to leave you with an excerpt from an excellent book chapter, "Repetition and Reuse in Child Language Learning," by Colin Bannard and Elena Lieven.  The two take up the question of why Zipfian distributions are found in language.  A short suggested reading list with annotations follows.  Please feel free to leave links to other suggested reading in the comments..

Chapter Excerpt :

"Zipf's original claim was that [Zipf's Law arose as] the result of a compromise between the need of the speaker and the hearer to minimize their processing cost (what he called the principle of least effort).  The speaker wants to minimize the diversity of what is produced, but the hearer needs some diversity in order to disambiguate [the signal].  The recurrent transmission of language over this channel results in a situation where there is [both] enough repetition to make things easy for the speaker, and enough diversity to avoid ambiguity...

For those who do not like Zipf's cognitive model, a purely formal proposal was made by Simon (1955).  In this account, the likelihood of any word being repeated is exactly a function of how often it has been encountered before.  If language is generated according to this principle -- with a certain amount of probability space being held back for novel words to avoid stagnation -- then the resulting distribution is very close to what we observe in natural language.

A shared assumption of both these approaches is that the more a word or phrase has been heard before, the more it will be heard in the future. This kind of power law distribution is not peculiar to language and can, in fact, be observed in any number of social phenomena (e.g., links found on the internet, academic paper citations...).  The process that is thought to drive this is "preferential attachment" -- people prefer to link to websites and cite papers that have been linked to or cited before.  Similarly in language, they prefer to use words [and word sequences] that have been used by others before. ...Given the way in which language functions as a conventional communication system, this seems like a highly effective strategy by which to achieve successful communication.  The outcome of this process is... [that] speech will to some extent be repetitive and formulaic."

Suggested Reading List

Simon, H.A. (1955). On a class of skew distribution functions. Biometrika, 42, 425-440 DOI: 10.2307/2333389

Simon proposes that Zipfian distributions arise out of "preferential attachment," meaning that the more a word is heard, the more likely it is to be said in turn.  Cognitively, this suggests that speakers have both more practice with and better memory for words that they hear more often (and the rich, as they say, only get richer, leading to a heavily skewed distribution).

Miller, G.A. (1957). Some effects of intermittent silence. The American Journal of Psychology. DOI: 10.2307/1419346

Miller -- in contrast to Simon -- proposes that Zipfian distributions emerge from any randomly generated character sequence that include some word delimiter.  The famous quote? "Research workers in statistical linguistics have sometimes expressed amazement that people can follow Zipf's Law so accurately without any deliberate effort to do so. We see, however, that it is not really very amazing, since monkeys typing at random manage to do it about as well as we do."

Howes, D. (1968). Zipf's Law and Miller's Random-Monkey Model. The American Journal of Psychology., 81 (2), 269-272 DOI: 10.2307/1421275

Howes points to a possible problem with Miller's proposal : namely that the random character generator model would entail a direct correlation between word length and frequency.  However, no such correlation is observed in natural language.

Ferrer-i-Cancho, R., & Sole, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences of the United States of America, 100, 788-791 DOI: 10.1073/pnas.0335980100

Ferrer-i-Cancho and Sole offer a formal model showing that "Zipf's law is the outcome of the nontrivial arrangement of word–concept associations adopted for complying with hearer and speaker needs" and that "arranging signals according to Zipf's law is the optimal solution for maximizing the referential power under effort for the speaker constraints."  (They're math modelers, what do you expect...)

Biemann, C. (2007). A Random Text Model for the Generation of Statistical Language Invariants Proceedings of HLT-NAACL-07, Rochester, NY, USA

Biemann "proposes a plausible model for the emergence of large-scale characteristics of language without assuming a grammar or semantics."  "A key notion is the strategy of following beaten tracks: Letters, words and sequences of words that have been generated before are more likely to be generated again in the future..."

4 responses so far

  • AK says:

    Several thoughts:

    I wonder if we could find the same distribution of specific meanings in the use of a word (e.g. the use of "awesome" in its current devalued meaning).

    I suspect these mechanisms don't stop working when somebody becomes an adult: we should thus expect a dynamic process of language change within a speech community as words become newly popular.

    I also suspect that both children and (especially) adults adopt a word more quickly when higher-status individuals are using it than low-status. It would be interesting to plot the dynamic of such changes in children in response to hearing the word peer-group use vs. adult use. (As well as different status within peer-group).

    This also has implications in the evolution of the current language paradigm in isolated speech communities, especially pre-agricultural and early agricultural. The differences might have important implications WRT the way languages evolve(d) before and after the adoption of agriculture.

    We should also consider the role of poetic (epic) formulas in priming word frequency in pre-literate societies. Many of these formulas (in e.g. Homeric epic) are drawn from obsolescent or neighboring language paradigms to meet metrical and rhyme requirements. They would provide any specific speech community with a large fund of potential new "fad words" while individual communities could evolve semi-independently in response to the local dynamic.

  • Torbjörn Larsson, OM says:

    A word of caution: many reported power law distributions are in fact not that but exponential (or can't be determined as of yet), according to papers of Cosma Shalizi et al where they test for different fits. (Now that there is a method for testing, the situation can be expected to rectify over time.)

    See for example here, or Shalizi's blog.

    Though I suspect that in biological and social sciences power laws can be preferred, because simple predictive models like Simon's (say) seem to follow easily.

  • Torbjörn Larsson, OM says:

    My link didn't survive: http://arxiv.org/abs/0706.1062 .

  • William Idsardi says:

    Another word of caution, if I may. The other line of research missing from this summary is that of Benoit Mandelbrot (the source of Miller's approach). Mandelbrot has a lifetime of publications on power-law phenomena: see the Wikipedia entry for the Zipf-Mandelbrot law (http://en.wikipedia.org/wiki/Zipf–Mandelbrot_law), Mandelbrot's page at http://www.math.yale.edu/mandelbrot/, an extensive bibliography at http://www.nslij-genetics.org/wli/zipf/, see http://en.wikipedia.org/wiki/Pareto_distribution for a brief list of other phenomena that follow power-law distributions. Mandelbrot seeks a common explanation across these phenomena for power-law (heavy tailed) distributions. If that goal is the right one, then it seems unlikely that there will be a common cognitive psychology explanation for the phenomena (which include physical phenomena). Mandelbrot is also famous for his dire (and correct) predictions about financial arbitrage. The technical papers are in his _Fractals and Scaling in Finance_ (1997), a very readable account of the financial crisis and the danger of incorrect estimation from heavy-tailed distributions is _The Quants_ (2010). The financial crisis can be taken as a sobering warning on the use of the wrong distributions for statistical estimation. For example, in psychological experiments reaction times are often treated as being log-normally distributed (i.e. log(RT) ~ N(μ, σ)). However, in practice RT's often remain heavy-tailed even when log-transformed (see http://cogprints.org/6603/1/recin-psychrev.pdf for discussion and comparisons of models). The problem is that power-law distributions have infinite variance when 0 < α < 2, so regular statistical methods relying on finite variances will not work, and since they (obviously) underestimate the variance, they are anti-conservative with respect to the familiar Neyman-Pearson hypothesis tests.