This post is a scholarly addendum to today's main post, aimed to satisfy the curiosity of my academic readers. I'm going to leave you with an excerpt from an excellent book chapter, "Repetition and Reuse in Child Language Learning," by Colin Bannard and Elena Lieven. The two take up the question of why Zipfian distributions are found in language. A short suggested reading list with annotations follows. Please feel free to leave links to other suggested reading in the comments..
Chapter Excerpt :
"Zipf's original claim was that [Zipf's Law arose as] the result of a compromise between the need of the speaker and the hearer to minimize their processing cost (what he called the principle of least effort). The speaker wants to minimize the diversity of what is produced, but the hearer needs some diversity in order to disambiguate [the signal]. The recurrent transmission of language over this channel results in a situation where there is [both] enough repetition to make things easy for the speaker, and enough diversity to avoid ambiguity...
For those who do not like Zipf's cognitive model, a purely formal proposal was made by Simon (1955). In this account, the likelihood of any word being repeated is exactly a function of how often it has been encountered before. If language is generated according to this principle -- with a certain amount of probability space being held back for novel words to avoid stagnation -- then the resulting distribution is very close to what we observe in natural language.
A shared assumption of both these approaches is that the more a word or phrase has been heard before, the more it will be heard in the future. This kind of power law distribution is not peculiar to language and can, in fact, be observed in any number of social phenomena (e.g., links found on the internet, academic paper citations...). The process that is thought to drive this is "preferential attachment" -- people prefer to link to websites and cite papers that have been linked to or cited before. Similarly in language, they prefer to use words [and word sequences] that have been used by others before. ...Given the way in which language functions as a conventional communication system, this seems like a highly effective strategy by which to achieve successful communication. The outcome of this process is... [that] speech will to some extent be repetitive and formulaic."
Suggested Reading List
Simon, H.A. (1955). On a class of skew distribution functions. Biometrika, 42, 425-440 DOI: 10.2307/2333389
Simon proposes that Zipfian distributions arise out of "preferential attachment," meaning that the more a word is heard, the more likely it is to be said in turn. Cognitively, this suggests that speakers have both more practice with and better memory for words that they hear more often (and the rich, as they say, only get richer, leading to a heavily skewed distribution).
Miller, G.A. (1957). Some effects of intermittent silence. The American Journal of Psychology. DOI: 10.2307/1419346
Miller -- in contrast to Simon -- proposes that Zipfian distributions emerge from any randomly generated character sequence that include some word delimiter. The famous quote? "Research workers in statistical linguistics have sometimes expressed amazement that people can follow Zipf's Law so accurately without any deliberate effort to do so. We see, however, that it is not really very amazing, since monkeys typing at random manage to do it about as well as we do."
Howes, D. (1968). Zipf's Law and Miller's Random-Monkey Model. The American Journal of Psychology., 81 (2), 269-272 DOI: 10.2307/1421275
Howes points to a possible problem with Miller's proposal : namely that the random character generator model would entail a direct correlation between word length and frequency. However, no such correlation is observed in natural language.
Ferrer-i-Cancho, R., & Sole, R. V. (2003). Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences of the United States of America, 100, 788-791 DOI: 10.1073/pnas.0335980100
Ferrer-i-Cancho and Sole offer a formal model showing that "Zipf's law is the outcome of the nontrivial arrangement of word–concept associations adopted for complying with hearer and speaker needs" and that "arranging signals according to Zipf's law is the optimal solution for maximizing the referential power under effort for the speaker constraints." (They're math modelers, what do you expect...)
Biemann, C. (2007). A Random Text Model for the Generation of Statistical Language Invariants Proceedings of HLT-NAACL-07, Rochester, NY, USA
Biemann "proposes a plausible model for the emergence of large-scale characteristics of language without assuming a grammar or semantics." "A key notion is the strategy of following beaten tracks: Letters, words and sequences of words that have been generated before are more likely to be generated again in the future..."