Another Post from Bob Berwick. We are working on getting him to be able to post on his own. Stay tuned but until then enjoy. (NH)
Some of the most memorable novels spring to mind with a single sentence: “Every happy family is alike; every unhappy family is unhappy in its own way;” “Depuis longtemps, je me suis couché de bonne heure.” When it comes to linguistics, most would agree that top honors for most memorable goes to Chomsky’s colorless green ideas sleep furiously. Most memorable yes, but perhaps also most misunderstood. How so? Let me explain. Colorless green means to draw a distinction between grammatical nonsense and ungrammatical nonsense, the same words reversed: furiously sleep ideas green colorless. Now, you may have read colorless green ideas so often when leafing through Syntactic Structures that your mind simply skips right past these examples on page 16, (1) for colorless and (2), for its reversal. Or perhaps you’ve read colorless green ideas so often that it’s started making sense to you. Or perhaps you haven’t read SS at all, but merely heard it ‘bruited in the byways’ that modern day natural language statisticians have pooh-poohed the contrast, figuring out that example (1) is roughly 200,000 times more likely than (2) – a completely satisfying New Age account as to why colorless green ideas seems more OK than furiously sleep ideas. If you’re this last sort of reader, or even the first, you might safely conclude that you needn’t pay attention to such examples anymore. But, you would be wrong. Ironically, nearly 60 years ago, Chomsky set out just about the same explanation for the contrast between (1) and (2) as the one now in vogue among some statistical folks. What’s more, it’s actually better – the ancient explanation provides empirical evidence backing this distinction, including why colorless green ideas is in fact so memorable, which the more recent account somehow left behind. In short, New Age, meet Old Age: congratulations, you’ve just re-discovered what was already known a couple generations ago. The problem, it appears, is that same plague alluded to in my last post: when it comes to digging out the past, in computational linguistics you can check your library card at the door. While lousy scholarship has long been an endemic disease amongst AI people (Roger Schank once boasted that he never read anything because it “destroyed his imagination to think of new ideas” – but don’t get me started, or I’ll explain how another current AI paramour, ‘Bayes nets’ were invented by a second-year Harvard grad student in 1918, not, as is commonly believed, in the 1980s), the pestilence has spread far and wide (cf. the Internet Veil of Ignorance). So, Sherman, let’s set the Wayback Machine to, oh, the years 1955-1956. And the place? Somewhere between Philadelphia and Cambridge, Massachusetts.
We’ve arrived. Blowing the dust off a long, typed manuscript that has somehow vanished down the Orwellian memory hole, labeled “The Logical Structure of Linguistic Theory” aka LSLT (Order 91920 filmed by Harvard College Library), we turn to Chapter IV-145–IV-147, examples 17′ and 18′ – our examples colorless green… and furiously sleep. It’s these pages that Chomsky drew on for his class notes, SS. And, crucially, the original contains a missing puzzle piece that didn’t find its way into SS – a revolutionary example, as it turns out.
On page IV-146, Chomsky observes that it is a matter of empirical fact that English speakers readily distinguish (1) from (2): “Yet any speaker of English will recognize at once that (1) is an absurd English sentence while (2) is no English sentence at all, and he will consequently give the normal intonation pattern of an English sentence to (1), but not to (2)” [examples renumbered to match SS.] This bit of empirical evidence is duly noted in SS: “a speaker of English will read (1) with a normal sentence intonation, but he will read (2) with a falling intonation on each word: in fact, with just the intonation pattern given to any sequence of unrelated words. He treats each word as a separate phrase” (1957, 16). In other words, they parse (1) into its proper constituent phrases, with colorless green as the Subject; sleep furiously as the Verb Phrase, and so forth – further empirical confirmation is that it’s easier to recall (1) than (2) – which is why colorless green takes honors as memorable – nobody remembers furiously sleep ideas green colorless.
But since in fact no English speaker has actually encountered either sentences (1) or (2) before, how do they know the two are different, and so give (1) normal intonation, assigning it normal syntactic structure, while pronouncing (2) as if it had no syntactic structure at all? Clearly, English speakers must be using some other information than a literal count of occurrences in order to infer that colorless green… is OK, but not furiously sleeps... Like what? Chomsky offers the following obvious solution – the puzzle piece that’s not in SS: “This distinction can be made by demonstrating that (1) is an instance of the sentence form Adjective-Adjective-Noun-Verb-Adverb, which is grammatical by virtue of such sentences as revolutionary new ideas appear infrequently that might well occur in normal English” (1955, IV-146; 1975:146). So let’s get this straight: When the observed frequency of a particular word string is zero, Chomsky proposes that people side-step the problem by using aggregated word classes rather than literal word frequencies, so that colorless falls together with revolutionary, green with new, and so forth. People then assign an aggregated word-class based phrase structure to (1), so colorless green ideas’s effective probability is no longer zero, but something parasitic on revolutionary new ideas. (In the Appendix to Chapter IV, 1955, Chomsky even offers an information-theoretic clustering algorithm to automatically construct such categories, with a worked-example, work done jointly with Peter Elias. But we won’t go there today.)
Turning now to one modern statistical take on the same problem, what do we discover? The same solution: aggregate words into classes, and then use class-based frequencies to replace zero count word sequences. Here’s the relevant New Age excerpt: “we may approximate the conditional probability p(x,y) of occurrence of two words x and y in a given configuration as, p(x)∑C p(y|c)p(c|x)”…. “In particular, when (x,y) [are two words] we have an aggregate bigram model (Saul & Pereira, 1997), which is useful for modeling word sequences that include unseen bigrams” (Peirera, 2000:7). Roughly then, instead of estimating the probability that word y follows the word x based on actual word counts, we use the likelihood that word x belongs to some word class c, and then use the likelihood that word y follows word class c. So for instance, if colorless green never occurs, we instead note that colorless is in the same word class as revolutionary – i.e., an Adjective – and calculate the likelihood that green follows an Adjective. In turn, if we have a zero count for the pair green ideas, then we replace that with an estimate of the likelihood Adjective-ideas…and so on down the line. And where do these word classes come from? As SP note, when trained on newspaper text, these aggregate classes often correspond to meaningful word classes. For example, in SP’s Table 3, p. 84, with 32 classes, class 8 consists of the words can, could, may, should, to, will, would. The New Age canon then continues: “Using this estimate for the probability of a string and an aggregate model with C = 16 [ie., 16 different word classes - rcb] trained on newspaper text…we find that… p(colorless green..)/p(furiously sleep) ≈ 2 × 10-5” (i.e., about 200,000 times greater). In other words, roughly speaking, the part of speech sequence Adjective-Adjective-Noun-Verb-Adverb is that much more likely than the sequence Adverb-Verb-Noun-Adjective-Adjective.
So what hath stats wrought? Two numbers, yes. But a revolutionary new idea? Not so much. All the numbers pitch up is that I’m 200,000 times more likely to say colorless green ideas… than furiously sleep ideas... a statistical summary of my external behavior. But that’s it, and it’s not nearly enough. This doesn’t really explain why people bore full-steam ahead on (1) and assign it right-as-rain syntactic structure, pronounced just like revolutionary new ideas and just as memorable, with (2) left hanging as a limp list of words. The likelihood gap doesn’t – can’t – match the grammaticality gap. As a previous post put it, that’s just not the game we’re playing: “the rejection of the idea that linguistic competence is just (a possibly fancy statistical) summary of behaviors should be recognized as the linguistic version of the general Rationalist endorsement of the distinction between powers/natures/capacities and their behavioral/phenomenal effects.” UG’s not a theory about statistically driven language regularities but about capacities. Nobody doubts that stats have some role to play in the (complex but murky) way that UG and knowledge of language and god knows what else interact so that the chance of my uttering carminative fulvose aglets murate ascarpatically works out to near zero, while for David Foster Wallace, that chance jumps by leaps and bounds. Certainly not SS, which in the course of describing (1) and (2) explicitly endorses statistical methods as a way to model human linguistic behavior – SS fn 18, p. 17. But I don’t give an apatropaic hoot about modeling the actual words coming out of my mouth. Rather, I want to explain what underlies my capacities.