Tuesday, November 11, 2014

Size Matters

I post here for Bob Berwick.

Size Matters: If the Shoe Fits

Everyone’s taught in Linguistics 101 that Small is Beautiful: At the end of the day, if you wind up writing one grammar rule per example, your grammatical theory amounts to zip – simply a list of the examples, carrying zero explanatory weight.  Why zero? Because one could swap out any part of the data for something entirely different, even random examples from, say, Martian and wind up with a rule set just as long.  Computer scientists immediately recognize this for what it is: Compression. And most linguists know the classic illustration – English auxiliary verb sequences, from Logical Structure of Linguistic Theory to Syntactic Structures (1955/1975) and then to Aspects (1965), buffed to high polish in Howard Lasnik’s Syntactic Structures Revisited (2000), chapter 1.  Forgetting morphological affixes for the moment, there are 8 possible auxiliary verb sequences – 1 with zero aux verbs; 3 with 1 aux verb (some form of modal, have, or be); 3 with two aux verbs, and one with all 3 aux verbs, so 8 possible aux patterns in all. Assuming 1 rule per pattern, wind up with 8 rules, a description as long as the original data we started with. Worse, 8 rules could describe any sequence of three binary-valued elements (modal or not, have or not, be or not)[1].  However, if we assume that our representational machinery includes parentheses that can denote optionality, we can boil the 8 rules down to just one:  Aux à (Modal)(Have)(Be). In this way Compression = generalization. But, Aspects notes, whether we’ve got parentheses-as-optionality at our disposal or not is entirely an empirical issue. To again draw from the classic conceit, if we were instead Martians, and Martian brains couldn’t use parentheses but only something like ‘cyclic permutation’, then the Martians couldn’t compress the English auxiliary verb sequences the same way – what Martians see as small or large depends on what machinery they’ve got in their heads.[2]

However, this example misses – or perhaps tacitly puts aside – one important point: Size matters in another way.  If your shoes fit, they fit for two reasons: they are not too small, but they are also not too large.  As Goldilocks said, they are just right.  Which is simply to say that a good grammar not only has to compress as much as possible the data it’s trying to explain, wringing out the right generalizations, but it also has to fit the data as closely as possible. A grammar shouldn’t generate sentences (and sentence structures) that aren’t in the data.  Otherwise, a very, very small grammar, with just 3 or 4 symbols, e.g., S → Σ*, would ‘cover’ any set of sentences, and we’d be done. But that would be just like saying that Yao Ming’s size 18 basketball shoes fit everyone.  Of course they do, but for most people they are way too large.  Likewise, this one rule overgenerates wildly. The measure we want should reward both small grammars and tightly fitting ones, in some balanced way. Surprisingly, something resembling this appears among the earliest documents in contemporary generative grammar:[3]

“Given the fixed notation, the criteria of simplicity governing the ordering of statements are as follows: that the shorter grammar is the simpler, and that among equally short grammars, the simplest is that in which the average length of derivation of sentences is least” (Chomsky, 1949; 1951, p. 3).
So what do we find here? The first part – the shorter the grammar, the simpler (and better) – sounds pretty close to “Small is Beautiful”.  But what about the second part, measuring how the grammar “fits” the data? That’s less clear. One might argue that the “average derivation length of sentences” could serve as a proxy for goodness of fit.  This isn’t entirely crazy if we can identify longer derivations somehow with fit, but it isn’t clear how to do this, at least from this passage.[4] Then there’s the two-part aspect – first we collect the shortest grammars, and only then do we check to find the ones that have the shortest derivation lengths covering the data. So the two criteria aren’t balanced. Some improvement is called for.

It was the late computer scientist Ray Solomonoff (1956, 1960) who first figured out how to patch all this up – ironically it seems, on hearing Chomsky talk about generative grammar in at the IEEE Symposium on Information Theory, 1956, where Chomsky first publicly presented his “Three models for the description of language.” Ray realized that probabilistic grammars would work much better than the arithmetic formulas he had been using to study inductive inference.  The result was the birth of algorithmic information theory, which concerns the shortest description length of any sort of representation – including grammars (now called Kolmogorov complexity).[5]  The key insight is the modern realization that information, computation, and representation are all intertwined in an incestuous, deep ménage à trois: now that everyone has PCs, we are all familiar that it takes a certain number of “bits” to write down or encode, well, this blog, coding the letters a through z as binary strings, the way your computer does. But as soon as we talk about the number of bits required for descriptions, we have already entered into the realm of Shannon’s information theory, which considers exactly this sort of problem[6].  Probabilities are then necessary, since, very informally, the amount of information associated with a description is equal to the negative logarithm of its probability. Finally, since encoding and decoding require some kind of algorithm, computation’s in the mix as well. 

What Solomonoff discovered is that we should try to find the shortest description of a language wrt some family of grammars, where crucially the description includes the description length of the grammar for the language as well. This attempts to minimize two factors: (1) the length in bits of the data (some finite subset of a language) as encoded (generated) by that grammar[NH1]  plus (2) the length in bits of the description of the grammar itself (the algorithm to encode/decode the language). That’s the two-part formulation we’re looking for (and now we can more clearly see that “average derivation length,” assuming the exchangeability of derivation length for probabilities, with longer derivations corresponding to lower probability of generation, is in fact a kind of proxy for coding the fit of a grammar to a language).[7]  Now, if a dataset is truly random and contains no patterns at all then its description will be about as long as the data itself–there’s no grammar (or more generally no theory) that can describe it more compactly than this. But if the data does contain regularities, like the Aux verb examples, then we can wring out of the data lots of bits via a grammar. The single Aux rule takes only 10 symbols to write down (9 on the right-hand side of the rule, and S on the left-hand side), while listing all the aux sequences explicitly takes 20 symbols.[8] 

In short, minimizing factor (2) aims for the shortest grammar in terms of bits, but this must also take into account factor (1), minimizing the length of the encoded data. So there is an explicit trade-off between factors (1) and (2) that penalizes short but overly profligate grammars, or, conversely, long and overly restrictive grammars. A grammar such as “S → Σ*” is only 3 symbols long (where the alphabet includes all English words) and certainly covers the Auxiliary verb examples, but each auxiliary verb example is again generated via a separate, explicit expansion and the grammar wildly overgenerates. While the size of this grammar plus the data it covers might be short for the first one or two examples, after just a few more cases the single-rule grammar we displayed earlier beats it hands down. Worse, since this grammar also generates all the invalid aux strings, it will also code them just the same – just as short – as valid strings, that is, in a relatively small number of bits. And that signals overgeneralization. Similarly, an overly-specific grammar will also lose out to the single-rule aux grammar.  If the over-specific grammar generates exactly the first four auxiliary verb sequences and no more, it would have to explicitly list the remaining 4 examples, and the total length of these would again exceed the length of the single-rule grammar in bits. In this sense, minimizing description length imposes an “Occam’s Razor” principle that tries for the grammar that best compresses the data[NH2]  and the grammar.

The Solomonoff-Kolmogorov metric can be cashed out in several ways.  One familiar approach is the Bayesian approach to language learning, as used, e.g., originally by Horning (1969).  This in fact was Solomonoff’s tack as wellIt’s easy to see why. Given some data D and a grammar G ranging over some family of possible grammars, we try find the most likely G given D, i.e., the G that maximizes p(G|D). Rewriting this using Bayes’ theorem, this is the same as trying to find the grammar G in our family of grammars that maximizes p(Gp(D|G)/p(D).   Since p(D) is fixed, this is the same as simply trying to find the G that maximizes the numerator, p(Gp(D|G) – the prior probability of G, with smaller G’s corresponding to higher prior probabilities, times the likelihood of the data given the grammar – how well the data is fit by the grammar. Exactly what we wanted.

The Solomonoff-Kolmogorov measures can also be closely linked to the so-called Minimum Description Length principle (MDL) of Rissanen (1983), an apparent rediscovery of the Solomonoff-Kolmogorov approach, which also states that the best theory is the one that minimizes the sum of the length of the description of the theory in bits, plus the length of the data as encoded by the theory.[9]

So does this shoe fit?  How has this notion fared in computational linguistics? The real problem for all description length methods is that while they provide a reasonable objective function as a goal, they don’t tell you how to search the space of possible hypotheses. In general, the Solomonoff-Kolmogorov-MDL approach formally establishes a familiar trade-off between rules and memorized patterns: if any sequence like kick the bucket appears frequently enough, then it saves many bits to record it via (generative) rules, rather than simply listing it verbatim in memory.  And the more often a collocation like this appears, the more savings accrue. Such methods have been adopted dozens of times over the past 50-odd years, often directly in models of language acquisition, and often in a Bayesian form (see, e.g., Olivier, 1968; Wolff, 1977, 1980 1982; Berwick, 1982, 1985; Ellison, 1991; Brent and Cartwright, 1993; Rissanen and Ristad, 1994; Chen, 1995; Stolcke, 1994; Dowman, 1997; Villavincencio, 2002; Briscoe, 2005; Hsu & Chater. 2010; Hsu et al., 2011; and several others).  Most recently, MDL has been revived in an especially insightful way by Rasin and Katzir, 2014, “On evaluation metrics in OT.” At least to my mind, Rasin and Katzir’s MDL approach make a good deal of sense for evaluation metrics generally, and is well worth considering seriously as the 50th anniversary of Aspects approaches. It’s available here:

One blog post clearly can’t do justice to all this work, but, for my money, the MDL highwater mark so far remains Carl De Marcken’s seminal Unsupervised Language Acquisition (1996 doctoral dissertation, MIT), available here:; and summarized (and extended) in his 1996 ACL paper, Linguistic Structure as Composition and Perturbation, here: Starting out with only positive-example strings of phonemes pseudo-transcribed from speech (so in particular, no ‘pauses’ between words), Carl’s unsupervised MDL method crunches through the Wall Street Journal and Brown corpuses, building generatively stored hierarchical “chunks” at several linguistic levels and “compiling out” common morphological, syntactic, and even syntactic-semantic patterns compositionally for reuse, e.g., from “to take off as [vp <v><p><np>]◦[v take]◦[p off]◦[np ], all the way to “cranberry {RED BERRY} as cranberry{BERRY+RED} (here the open circle ◦ is a composition operator that’s just string concatenation in the simplest case, but hierarchical composition for syntax, i.e., merge). As Carl notes, “no previous unsupervised language-learning procedure has produced structures that match so closely with linguistic intuitions.”  So, just about 20 years ago, we already had a learning method based on Solomonoff-type induction that really worked – on full-scale corpora too.  What happened?  Carl’s approach was never really surpassed. A few people (most notably John Goldsmith) embraced the MDL idea for their own work and a few more picked up on Carl’s compositional & probabilistic hierarchical MDL-“chunks” approach. But for the most part it seems as though the AI and computer science & learning community fell into its usual bad habit of failing to stand on the shoulders of past success – hard to tell whether one should chalk this up to a familiar need to always appear sui generis, or to ensure that the MDL wheel would be faithfully rediscovered every generation to the same great surprise, or to just plain scholarly laziness. Whatever. Here’s hoping it won’t be another 20 years before we discover MDL yet again. It remains to be seen whether we can cobble together a better fitting shoe[NH3] .

[1]Actually ternary if we assume that any of the 3 aux forms can go in each position.
[2]But maybe not so much.  One might rightly ask whether there could be a universal representation or coding method that could cover any computational agent’s representational powers – Martian or otherwise. The answer is, “Yes, but.”  Solomonoff  (1960) was the first to show that a universal prior does exist – but that it is noncomputable. Yet this too can be fixed. For additional discussion, see Berwick, 1982; 1985; and Li & Vitányi, 2008.
[3]Contemporary because, as Kiparsky (2007) points out, the Pāṇini grammarians (400 BCE) also developed a simplicity principle for their version of generative grammar based on the idea that a rule should only be posited when it “saves” more feature specifications than it “costs” to memorize a form literally. A special thanks to Ezer Rasin for pointing me to Kiparsky’s recent lecture about this, here:
[4]But this can be done; exactly this measure has been used by computer science theorists; see, e.g., Ginsberg & Lynch, 1977. Derivation complexity in context-free grammar forms. SIAM J. Computing, 6:1, 123-138.  I talk about this a bit more in my thesis (Berwick, 1982; 1985).
[5]The Russian mathematician Kolmogorov independently arrived at the same formulation in the early 1960s, as did Chaitin (1969). See Li and Vitányi, Introduction to Kolmogorov Complexity and its Applications, Springer, 2008, for an excellent exposition of this field.
[6]A bit more carefully, following Shannon’s account (1948), if there are n symbols in our alphabet, and the grammar is m symbols long, transmitting a “message” corresponding to our grammar would take about mlog2 n bits  to “transmit” this grammar to somebody else
[7]This is not as easy to do as it might appear – it took Solomonoff several tries.  See Li and Vitányi 2008, pp. 383-389 for details about this.
[8]More properly, the we ignore the rewrite arrow and any details about the coding length for different symbols, Optimal coding schemes would alter this slightly.
[9]Again, this is not as straightforward; see Li & Vitanyí, pp. 383-389 for a discussion of the additional conditions that must be imposed in order to ensure the interchangeability between Kolmogorov complexity and MDL.

 [NH1]I assume that there are better and worse ways to do this as well? So more compact descriptions of the data are better?  Or this not so? Maybe an example wrt aux system would help the illiterate like me.
 [NH2]So best compresses the PLD and the G?
 [NH3]Really nice. You might want to do a follow up showing how some of this worked in an example or two. It’s really neat stuff. Excellent.


  1. I had one of those "Two legs good, four legs bad" moments when I read this post.

    So I count myself as a fan of de Marcken's work, with maybe more reservations than you, because I don't think it works that well, compared to standard Bayesian approaches,
    and I am not 100% convinced by the small is good mantra (particularly given the success of instance based learning and ensemble methods).
    But why is this work better/different from all of the other approaches? I would have thought that Norbert and Chomsky would hate this stuff.
    Statistical analysis of raw corpora, no semantics in the input, no parameters in the model, an unrestricted hypothesis space, an empirical evaluation, and it can't learn parasitic gaps...
    Why aren't you interested in more modern Bayesian work like the unsupervised part of
    Cohn, Blunsom and Goldwater (here) from 2010?

  2. Funny - I had the "did we read the same work?" moment when I read the comment. A lot was, um, compressed in Carl's work, highlighted more in his 'composition and perturbation' paper - the forerunner of the CBG 'tree substitution method' you refer to (though par for the course, CBG take no note of such prior art). Why better? Because it was the first, AFAIK, because it produced a better compression than gzip; because it could go all the way from textually represented phonemes (*not* "raw corpora", whatever that might mean), all the way to syntax and semantics (and so *did* use semantics in the input, viz., cranberry, {BERRY, RED}).

    The entire point of the post was that the 'small is beautiful mantra' tells only half the story. Empirical evaluation - which to me means accounting for data - is part of the rest, and I'm sure Norbert subscribes to *that* - ask him. I’ll leave comparisons between MDL and Bayesian inference for the experts Li and Vitanyi. As for modernity – I’m all for modernity when it clarifies; the introduction of nonparametric methods and alternative prior schemes has proved to be a technological boost generally. But somehow I remain unimpressed by the gain so far, partly because the return to dependency grammars, for me at least, a recidivism, stepping backwards and throwing away what we’ve learned about linguistic constraints. A topic for another day.

  3. Norbert is fine with this stuff. It will all depend on what we count as the "data" and what predicates we use to measure size of G. In other words, I have nothing against evaluation measure theories per se. The argument against them was their feasibility, not their relevance. So, I can sign onto the view that the smallest theory that handles all the relevant data is the best (or at least an excellent candidate). Where Norbert has argued with Alex, and I suspect will continue to do so, is what is the relevant data, and what counts as accounting for it. These to one side, we can come to a kumbaya moment, though it will be of little interest given that we've abstracted from the central problem.

    1. I was being a bit flippant but with a serious point -- I think the de Marcken style of doing things is a reasonable way of approaching the problem and I did some work along these lines back in the day. But it seems to be an example of exactly the sort of work that Norbert has often attacked. But let me channel my inner Norbert --

      "the central problem is to account for language acquisition given that what we know about generative grammar is correct. But the de Marcken approach doesn't account for any of the generalisations that GGers have discovered and it can't learn any of the syntactically interesting phenomena -- parasitic gaps, ECP effects etc etc. So it isn't even a starter. "

    2. Actually, my recollection his that Carl had things to say about headedness and its virtues. This was what I really liked a lot. And still do.

  4. If one is not comfortable with MDL as minimizing grammar size plus data description, there's a slightly different version that formulates the same principle in a way that might be even closer to what linguists often think they are doing: minimizing grammar size and the length of *exceptions* to the grammar. Li and Vitanyí discuss this on p. 397-398

  5. I wonder if another alternative presentation might not also be workable, and more appealing to certain kinds of people, which is that what is minimized is not size as such, but the rate of growth of size (of grammar+data specification) as the data rolls in. The motivation being that this feels 'dynamic' (more appealing in some quarters), but also might even be closer to correct, in that children, with their vast capacity to learn copious amounts of irregular morphology, seem very unlikely to me to care at all about grammar size as such, but might well notice how much work they have to do to accomodate new data in the system. For example, the child who has learned 'blunderbuss' and the plural rule only has to add pointers to the lexical item and the PL morpheme to specify the occurrence of 'blunderbusses' in the story she's hearing, while the one who hasn't will have to add a fully specified lexical item. (McCawley made the same suggestion w.r.t the original evaluation metric, iirc)