This morning, I scrawled a letter to a friend that began with the following:
Spent the weekend at the FyeahFest with a starry-eyed lot of starving hipsters, in vintage hops and wingtips. Had not realized how obvious the effects of doing molly are on the pupils… the droves wandering past had eyes like shining saucers.
Unlike my trusted ami de plume, you may not know what on the lord's green earth I just said. In particular, if you’re not into psychedelics or don’t know anyone who is, you may be wondering just what ‘molly’ is, anyway. The extraordinary thing is that -- odds are -- even if you’ve never heard the word used before, you can probably wager a pretty good guess as to what it means.
Take a moment. What’s your bet?
The context gives you some immediate clues. First, the word ‘molly’ is prefaced by the phrase “the effects of doing…” If you were to Google that phrase, you would find that some of the most frequent words to follow that sequence are kinds of drugs: Heroin. Ecstasy. Shrooms. Peyote. All of these words appear on the first page of search results.
If you know anything about the serotonergic effects of certain drugs, or the predilections of LA hipsters, you can then safely narrow it down to MDMA or psilocybin. And even if you don’t get that far in your guesswork, you should be able to safely assume that molly is, after all, a drug – and one that gets popped at indie concerts, by skinny teenagers in tight pants, with ironic mustaches and ugly hats. Yes, it’s that kind of drug.
So you haven't quite figured out what molly is precisely, but you've managed to dramatically narrow the search space.
Alright, alright, you say -- this isn't so earth-shattering. This is just logical deduction -- right?
But there is something quietly astounding about that process, at least from the vantage point of how we learn and understand language.
Here’s one thing to consider : how did you know, in context, that the phrase “the effects of doing…” should be followed by a drug?
It turns out that in English -- and in every other language studied -- words do not bunch together with equal frequency. For example, there are many possible words that could potentially follow the phrase “the effects of doing…” but which rarely or never do so. Compare the number of Google hits for “the effects of doing drugs” (13,900) with :
…the effects of doing research (6)
…the effects of doing everything (4)
…the effects of doing better (2)
…the effects of doing laundry (1)
…the effects of doing yardwork (0)
This is a strange state of affairs. For one thing, the word ‘research’ is actually 22 times more frequent a word than is ‘drugs,’ overall. For another thing, if you dig into the phrase, you find that the trigram ‘of doing research’ is still 10 times as frequent as ‘of doing drugs.’ But in the grand scheme of things, none of this matters – because when we talk about ‘the effects of doing’ something, we’re not usually talking about research, yardwork or laundry. What we’re usually talking about are things like business, exercise or drugs. And in this case – quite obviously – drugs. (Unless you’d like to assume that it’s yoga what’s getting all them hipsters starry-eyed).
What’s interesting about all this, of course, is that we can take a corpus – a large record of written or spoken English, like Google, for example – and show that there are these massive frequency differences in terms of how we talk about things. We “juggle balls” more than we “juggle spoons,” and “run away” more than we “run defense.” Words distribute in Zipfian-like patterns, with some words (and longer sequences of words) clustering together very, very commonly , while others hardly ever appear together at all – even though there is no ‘grammatical’ or even logical prohibition against such combinations .
What's more, there's strong evidence that we make constant and considerable use of this kind of frequency information in everyday speech -- almost as if we've internalized something very much akin to Google .
When we are engaged in listening to a conversation, for instance, we carry with us expectations about what will be said next based both on the topic under discussion, and on the particular stream of words we're hearing at that instant. These expectations are wholly probabilistic in nature. To give two contrasting examples -- if we heard someone begin to say "why did the chicken cross..." we would be anticipating "...the road" with almost 100% certainty. But if, on the other hand, we heard the phrase "the effects of doing..." we would be holding "drugs," "business in China," "exercise," "nothing," and a bunch of other possibilities in reserve (and waiting on the actual punchline). Those expectations would be weighted according to the degree that those words and phrases actually occur following that distributional context  and to the extent that they were predicted by the prior discourse.
In a sense then, the process of 'narrowing the search space' arises out of how we use what we know about how words distribute in language to predict what's going on in any given conversation. We start off with a certain set of expectations about what will be said, which is then violated when we encounter a new and unfamiliar word (like molly). Amidst the surprise, learning occurs : the expectations that we had for what would fill that space then inform what we take that mystery word to mean. Of course, in this, we generalize. We guesstimate. We don't come at it with absolute precision.
Molly, for example, doesn't simply mean 'drugs' and it doesn't simply mean 'e' (if that's what you landed on). Molly is a slang term for 'molecular' and it refers to pure MDMA capsules that haven't been cut with any other drug -- like caffeine, speed, or coke. According to some, molly is the 'holy grail' of ecstasy.
So -- you didn't nail it this time. But if I hadn't told you outright, you'd probably have learned it the next time you heard it, or the time after. Maybe you would have used it to mean 'drugs [at large],' and some knowing space cowboy would have checked your tongue. Over time, your expectations about that search space would become more and more refined, until you had discriminated just how it was used (with these words, and these phrases, and not others; in these contexts, and conversations, over others).
There's much more to say : about semantic-space models; about similarity-based generalization and contextual discrimination; and about the many good reasons to believe that word learning is actually a process driven largely by discrimination (and not by generalization, as many researchers seem to assume). Perhaps, in there too, there's an anecdote about the hilarity that ensues when high schoolers prepped for the SAT learn words out of context. For now, I'm a slave to sleep. But it's comin'...
 Indeed, some phrases – like ‘years ago’ and ‘at the same time’ – are actually more frequent than fairly common words, like ‘bread’ and ‘doctor.’ This takes us back to the question, discussed in an earlier post, of whether words should be rightly considered as individual units of meaning (as set apart from larger chunks of language).
 While there has been some sustained debate over why this occurs, many scientists now think that language is Zipfian because of a cognitive phenomena known as "preferential attachment." The idea is simply that the more often you hear a word, the more practice you'll have with it, the better you'll remember it, and thus the more likely you'll be to produce it in turn. This leads to a "rich become richer" cycle, in which certain 'preferred' words become more and more high frequency, while the rest lope off into low-frequency oblivion.
Intriguingly, you can see this happening on a small scale when you look at a text. Because of the long tail of language, most words will not occur in a given document. However, if a relatively rare word should occur, it is much more likely to be seen again in that same text. This phenomena is known as "burstiness," because it captures how words occur in 'bursts.' Church & Gale (1995) give the following (somewhat amusing) example: "The Poisson distribution predicts that lightning is unlikely to strike twice in a single document. We shouldn't expect to see two or more instances of boycott in the same document (unless there is some sort of hidden dependency that goes beyond the Poisson). But when it rains, it pours. If a document is about boycotts, we shouldn't be surprised to find two boycotts or even a half dozen in a single document."
[For more on the debate about why language is Zipfian, I've added a supplementary post for scholars].
 There are dozens upon dozens of papers to cite in this literature, in speech-processing, the visual-world paradigm and in information theory -- I will save the raft of citations for a later annotated post.
 There is evidence from ERP that given a particular distributional context, readers anticipate upcoming words in a graded fashion that is strongly correlated with their actual likelihood of occurrence within that context. See for example, DeLong, Urbach & Kutas, 2005.
KW Church,, & WA Gale (1995). Inverse document frequency (IDF): A measure of deviations from Poisson In A. et al. (Ed.), NLP using Very Large Corpora. Kluwer Academic Publishers.
MacDonald, S., & Ramscar, M. (2001). Testing the Distributional Hypothesis: The Influence of Context on Judgements of Semantic Similarity. Proceedings of the 23rd Annual Conference of the Cognitive Science Society, University of Edinburgh.
DeLong, K., Urbach, T., & Kutas, M. (2005). Probabilistic word pre-activation during language comprehension inferred from electrical brain activity. Nature Neuroscience, 8 (8), 1117-1121 DOI: 10.1038/nn1504
"Just because I'm dressed this way, does not make me a police officer."