The Long Tail of Language

Aug 31 2010 Published by under Forget What You've Read!, From the Melodye Files

“The truth is rarely pure and never simple.  Modern life would be very tedious if it were either, and modern literature a complete impossibility!”
–Oscar Wilde, The Importance of Being Earnest

My apologies to readers who may be wondering when the promised series would materialize.  The weekend was spent taking snaps of Laura La Rue and drinking double-digit vino in the kitchen with Professor Plum and Miss Scarlet.  If this science-thing doesn’t work out, I’m off to join the jet-set.  I’m still wondering if it’s possible to style myself after a desperately bespectacled Grace Kelly?

(More on that later…)

In any case : in today’s posting, I take up a rather curious property of human languages that you may have never properly been introduced to.  And that property is Zipfian.

George Kingsley Zipf – if you haven’t read the man, personally –  was a Harvard linguist and philologist who studied the various statistical properties of human languages.  He is most famous for discovering the eponymous ‘Zipf’s Law’ : the finding that the frequency distribution of words in a given language follows an inverse power curve.

But – you say – hold the phone!  What in Zipf’s name does that mean?

Quite what you’d expect, madame; presuming you’re up on your power laws…  When we speak, we use some words – like ‘the’ and ‘of’ all the time (they’re hugely popular, really).  But the popularity curve tails off fairly rapidly after that – ‘man’ is quite a bit less popular, and ‘book’ less popular still.  And ‘brain’!  Why we hardly use the word at all, in the great scheme of things…  And just think of ‘phantasmagory’ or ‘bastinado,’ or ‘boorishness’ or ‘bombast’!

This means that should we inspect a given sample of language – this post, for example – we will find an abundance of certain highly-frequent word types, but little to no evidence about most other possible words, which make up the silent majority.  For instance, in this post thus far, the word ‘the’ has already occurred 25 times, while the word ‘nightshade’ has occurred – well, only the once, just then.

This state of affairs may appear somewhat strange at first glance; clearly, our conversations aren’t simply running streams of “the an on over of.”  However, this raises an obvious point: that many of the words we use quite commonly are function words, which serve as bits of linguistic scaffolding. Meanwhile, it is the long tail of ‘everything else’ – the rarer and more variegated words – which makes up the flesh and blood of our language, and the content of our communication.

“In the Oxford English dictionary, there are nearly half a million lexical items.  With no interruptions it would take about three solid days of filibustering to pronounce so many words.  If the supply of different words were the only controlling factor, we could carry out our daily quote of verbalizations for weeks before we had to use the same word a second time.  The statistical fact is, however, that we manage to get along for only 10 or 15 words, on the average before we repeat ourselves.  In writing, our favorite word is ‘the.’  On the telephone, our favorite word is ‘I.’  The 50 most commonly used word types make up about 60% of the word tokens we say…”
–George Miller, Language and Communication

Of course, Zipf’s law holds for more than just individual words.  It also appears to hold for longer strings of words (such as bigrams) and even for word meanings.  For example, take a semantically promiscuous word like ‘make,’ which has several hundred definitions in the Oxford English dictionary.  If we look at a large record of human speech, what we find is that usage of ‘make’ follows a Zipfian curve, with a relative handful of meanings making up the majority of uses (Lorge, 1937; Thorndike & Lorge, 1944).  (Intriguingly, it is precisely the words that are most commonly used in our language that have the most meanings!)

That human language appears to be governed by implicit statistical regularities in this way raises a deeply fascinating and highly contentious question-- indeed, a question that has been subject to much empirical and theoretical scrutiny over the past fifty years.  The question is this : is language predictive and probabilistic in structure, such that it is governed by statistical regularities that we learn over time and exposure?  Or is language based upon an innate system of recursive rules, which govern basic syntax, and from which words and sentences are constructed?

To translate, for the non-expert : when we say or write something (e.g., "She told me he had painted the landlady's walls red this morning") is our speech organized according to a structured, rule-based grammar that carefully dictates each word's rightful place? (see Chomsky, 1957; 1963)  Or is our production governed by learned probabilities that help us navigate speech -- this word usually goes with this, and this that, and this kind of word is often used like this word, except in these ways...

This question has stirred up quite a bit of debate as you might imagine, not least because it begs the question of how much of language comes prepackaged.  A rule based account suggests that we must come to the task of learning language with a highly complex, innate architecture built into our brains specifically for the task, which must somehow account for how languages of every different sort and stock can be learned according to one universal 'hardware' package.  A probabilistic account suggests that we come to learning language with a suite of general learning mechanisms that have been co-opted for the task, and which allow us to effectively mine the statistical structure of language we are exposed to, such that we learn -- over time  -- how to make flexible use of words (and sequences of words) to speak creatively and productively.

In this, there have been raised quite a number of puzzles and problems over the decades, the most keen of which is the question of how -- if language is predictive and probabilistic -- we can possibly come to learn it.  This brings us up against the issue of 'data sparsity,' which I will touch on more thoroughly next time.

The short of it is : Given that we can produce long and highly complex sentences that we have never heard before (and indeed, that may have never been spoken before), how can we accurately estimate the likelihood of such sentences?  How can our 'probabilistic' experience with language inform whether or not these sentences are informative, meaningful or 'correct'?  In other words, how can we use statistics -- rather than rules -- to produce (and comprehend) something as seemingly structured as language?

This question is pressing because it isn't simply a logical or computational one.  If we can't nail down (or even approximate) the algorithms that would allow a human to do so, or if we think that the computational problem posed is far too hard, then perhaps there is little reason to believe that people could --even in principle-- be learning or using language statistically.  Instead, we might suggest that they were learning it in a rule-based fashion, which effectively allowed them to shortcut the otherwise-insurmountable learning process.

However, which approach we determine to be best is a question to be resolved computationally.  In the series to follow, I will illustrate why a predictive, probabilistic account of language is a good fit -- both empirically and theoretically --with what we understand about language acquisition and use; and show how (and why) many of the 'logical' problems parried against such an account turn out to not be so logical after all.

P.S.  To the unconvinced / already-decided-s : there is a Charles Yang post just waiting in the wings!  I caution patience.

P.P.S. If you want to read about Zipf in a non-linguistic setting, Chris Anderson did a fantastic article for Wired in 2004 about "The Long Tail" of the demand curve in the Internet age.

5 responses so far

  • I want to ask if it's necessarily either-or; whether the brain might mostly process language in the probabilistic way but supplement this with some architectural heuristics. Unfortunately you've just started the series and I don't know the subject deeply enough to say whether or not my question is even meaningful.

    • melodye says:

      To the contrary -- it's definitely a meaningful question. The short answer is : yes.

      Here's a helpful quote from Scholz & Pullum (2006) : "But from the claim that language acquisition must be affected by some sorts of bias or constraint it does not follow that those biases or constraints must stem from either linguistic universals or parameters. A non-nativist can readily accept biases or constraints stemming from sensory mechanisms that are specific to language but non-cognitive, or cognitive-computational mechanisms that are not language-specialized."

      In other words, it is perfectly rational for an empiricist -- who does not buy into the idea of a hardwired grammar -- to believe that the cognitive architecture of the human brain is in some way built (or biased) toward the acquisition of language. For example, I've reported on work by Thompson-Schill et al (2009) and Ramscar & Gitcho (2007) which suggests that delayed prefrontal cortical development may be part of the key to unlocking why humans -- and not other species -- develop language.

  • physioprof says:

    Very interesting! I love how power law distributions pop up everywhere in the complexity of biological reality. A really entertaining and well-written book that addresses this in the context of economics is "Black Swans" by Nicholas Taleb.

  • Dexter Edge says:

    Personally, I say "bastinado" several times a week (although my friends wish I wouldn't).