Size Matters: If the Shoe Fits
Everyone’s taught in Linguistics 101 that
Small is Beautiful: At the end of the day, if you wind up writing one grammar
rule per example, your grammatical theory amounts to zip – simply a list of the
examples, carrying zero explanatory weight. Why zero? Because one could swap out any part
of the data for something entirely different, even random examples from, say, Martian and wind up with a rule set just
as long. Computer scientists immediately
recognize this for what it is: Compression. And most linguists know the classic
illustration – English auxiliary verb sequences, from Logical Structure of Linguistic Theory to Syntactic Structures (1955/1975) and then to Aspects (1965), buffed to high polish in Howard Lasnik’s Syntactic Structures Revisited (2000),
chapter 1. Forgetting morphological
affixes for the moment, there are 8 possible auxiliary verb sequences – 1 with zero
aux verbs; 3 with 1 aux verb (some form of modal,
have, or be); 3 with two aux
verbs, and one with all 3 aux verbs, so 8 possible aux patterns in all. Assuming
1 rule per pattern, wind up with 8 rules, a description as long as the original
data we started with. Worse, 8 rules could describe any sequence of three binary-valued elements (modal or not, have or
not, be or not)[1].
However, if we assume that our representational machinery includes parentheses
that can denote optionality, we can boil the 8 rules down to just one: Aux à (Modal)(Have)(Be). In this way
Compression = generalization. But, Aspects
notes, whether we’ve got parentheses-as-optionality at our disposal or not is
entirely an empirical issue. To again draw from the classic conceit, if we were
instead Martians, and Martian brains couldn’t use parentheses but only something
like ‘cyclic permutation’, then the Martians couldn’t compress the English auxiliary
verb sequences the same way – what Martians see as small or large depends on
what machinery they’ve got in their heads.[2]
However, this example misses – or perhaps
tacitly puts aside – one important point: Size matters in another way. If your shoes fit, they fit for two reasons: they are not too small, but
they are also not too large. As Goldilocks said, they are just right. Which is simply to say that a good grammar
not only has to compress as much as
possible the data it’s trying to explain, wringing out the right
generalizations, but it also has to fit
the data as closely as possible. A grammar shouldn’t generate sentences (and
sentence structures) that aren’t in the data.
Otherwise, a very, very small grammar, with just 3 or 4 symbols, e.g., S
→ Σ*, would ‘cover’ any set of sentences, and we’d be done. But that would be
just like saying that Yao Ming’s size 18 basketball shoes fit everyone. Of course they do, but for most people they
are way too large. Likewise, this one rule
overgenerates wildly. The measure we want should reward both small grammars and
tightly fitting ones, in some balanced way. Surprisingly, something resembling
this appears among the earliest documents in contemporary generative grammar:[3]
“Given the fixed notation, the criteria of simplicity
governing the ordering of statements are as follows: that the shorter grammar
is the simpler, and that among equally short grammars, the simplest is that in
which the average length of derivation of sentences is least” (Chomsky, 1949;
1951, p. 3).
So what do we find here? The first part –
the shorter the grammar, the simpler (and better) – sounds pretty close to
“Small is Beautiful”. But what about the
second part, measuring how the grammar “fits” the data? That’s less clear. One
might argue that the “average derivation length of sentences” could serve as a
proxy for goodness of fit. This isn’t
entirely crazy if we can identify longer derivations somehow with fit, but it
isn’t clear how to do this, at least from this passage.[4]
Then there’s the two-part aspect – first we collect the shortest grammars, and
only then do we check to find the ones that have the shortest derivation
lengths covering the data. So the two criteria aren’t balanced. Some
improvement is called for.
It was the late computer scientist Ray
Solomonoff (1956, 1960) who first figured out how to patch all this up –
ironically it seems, on hearing Chomsky talk about generative grammar in at the
IEEE Symposium on Information Theory, 1956, where Chomsky first publicly
presented his “Three models for the description of language.” Ray realized that
probabilistic grammars would work much better than the arithmetic formulas he
had been using to study inductive inference.
The result was the birth of algorithmic information theory, which concerns
the shortest description length of any sort of representation – including
grammars (now called Kolmogorov complexity).[5]
The key insight is the modern
realization that information, computation, and representation are all
intertwined in an incestuous, deep ménage à trois: now that everyone has PCs, we
are all familiar that it takes a certain number of “bits” to write down or
encode, well, this blog, coding the letters a
through z as binary strings, the way
your computer does. But as soon as we talk about the number of bits required for
descriptions, we have already entered into the realm of Shannon’s information
theory, which considers exactly this sort of problem[6].
Probabilities are then necessary, since,
very informally, the amount of information associated with a description is
equal to the negative logarithm of its probability. Finally, since encoding and
decoding require some kind of algorithm, computation’s in the mix as well.
What Solomonoff discovered is that we should
try to find the shortest description
of a language wrt some family of grammars, where crucially the description includes the description length of the
grammar for the language as well. This attempts to minimize two factors: (1)
the length in bits of the data (some finite subset of a language) as encoded
(generated) by that grammar[NH1] plus (2) the length in bits of the
description of the grammar itself (the algorithm to encode/decode the
language). That’s the two-part formulation we’re looking for (and now we can
more clearly see that “average derivation length,” assuming the exchangeability
of derivation length for probabilities, with longer derivations corresponding
to lower probability of generation, is in fact a kind of proxy for coding the
fit of a grammar to a language).[7]
Now, if a dataset is truly random and
contains no patterns at all then its description will be about as long as the
data itself–there’s no grammar (or more generally no theory) that can describe
it more compactly than this. But if the data does contain regularities, like
the Aux verb examples, then we can wring out of the data lots of bits via a
grammar. The single Aux rule takes only 10 symbols to write down (9 on the
right-hand side of the rule, and S on the left-hand side), while listing all
the aux sequences explicitly takes 20 symbols.[8]
In short, minimizing factor (2) aims for
the shortest grammar in terms of bits, but this must also take into account factor
(1), minimizing the length of the encoded data. So there is an explicit trade-off
between factors (1) and (2) that penalizes short but overly profligate grammars,
or, conversely, long and overly restrictive grammars. A grammar such as “S → Σ*”
is only 3 symbols long (where the alphabet includes all English words) and certainly
covers the Auxiliary verb examples, but each auxiliary verb example is again
generated via a separate, explicit expansion and the grammar wildly overgenerates.
While the size of this grammar plus the data it covers might be short for the
first one or two examples, after just a few more cases the single-rule grammar
we displayed earlier beats it hands down. Worse, since this grammar also
generates all the invalid aux
strings, it will also code them just the same – just as short – as valid strings, that is, in a relatively
small number of bits. And that signals overgeneralization. Similarly, an overly-specific
grammar will also lose out to the single-rule aux grammar. If the over-specific grammar generates exactly
the first four auxiliary verb sequences and no more, it would have to
explicitly list the remaining 4 examples, and the total length of these would
again exceed the length of the single-rule grammar in bits. In this sense,
minimizing description length imposes an “Occam’s Razor” principle that tries
for the grammar that best compresses the data[NH2] and
the grammar.
The Solomonoff-Kolmogorov metric can be
cashed out in several ways. One familiar
approach is the Bayesian approach to language learning, as used, e.g., originally
by Horning (1969). This in fact was
Solomonoff’s tack as wellIt’s easy to see why. Given some data D and a
grammar G ranging over some family of
possible grammars, we try find the most likely G given D, i.e., the G that maximizes p(G|D). Rewriting this using Bayes’ theorem, this is the same as trying
to find the grammar G in our family
of grammars that maximizes p(G)×p(D|G)/p(D). Since
p(D)
is fixed, this is the same as simply trying to find the G that maximizes the numerator, p(G)×p(D|G)
– the prior probability of G, with
smaller G’s corresponding to higher
prior probabilities, times the likelihood of the data given the grammar – how
well the data is fit by the grammar. Exactly what we wanted.
The Solomonoff-Kolmogorov measures can
also be closely linked to the so-called Minimum
Description Length principle (MDL) of Rissanen (1983), an apparent
rediscovery of the Solomonoff-Kolmogorov approach, which also states that the
best theory is the one that minimizes the sum of the length of the description
of the theory in bits, plus the length of the data as encoded by the theory.[9]
So does this shoe fit? How has this notion fared in computational linguistics?
The real problem for all description length methods is that while they provide
a reasonable objective function as a goal, they don’t tell you how to search
the space of possible hypotheses. In general, the Solomonoff-Kolmogorov-MDL approach
formally establishes a familiar trade-off between rules and memorized patterns:
if any sequence like kick the bucket
appears frequently enough, then it saves many bits to record it via (generative)
rules, rather than simply listing it verbatim in memory. And the more often a collocation like this
appears, the more savings accrue. Such methods have been adopted dozens of times
over the past 50-odd years, often directly in models of language acquisition, and
often in a Bayesian form (see, e.g., Olivier, 1968; Wolff, 1977, 1980 1982;
Berwick, 1982, 1985; Ellison, 1991; Brent and Cartwright, 1993; Rissanen and
Ristad, 1994; Chen, 1995; Stolcke, 1994; Dowman, 1997; Villavincencio, 2002;
Briscoe, 2005; Hsu & Chater. 2010; Hsu et al., 2011; and several others). Most recently, MDL has been revived in an
especially insightful way by Rasin and Katzir, 2014, “On evaluation metrics in
OT.” At least to my mind, Rasin and Katzir’s MDL approach make a good deal of
sense for evaluation metrics generally, and is well worth considering seriously
as the 50th anniversary of Aspects
approaches. It’s available here: http://ling.auf.net/lingbuzz/001934.
One blog post clearly can’t do justice to
all this work, but, for my money, the MDL highwater mark so far remains Carl De
Marcken’s seminal Unsupervised Language
Acquisition (1996 doctoral dissertation, MIT), available here:
http://www.demarcken.org/carl/papers/PhD.pdf; and summarized (and extended) in his 1996
ACL paper, Linguistic Structure as
Composition and Perturbation, here: http://www.demarcken.org/carl/papers/acl96.pdf. Starting out with only positive-example
strings of phonemes pseudo-transcribed from speech (so in particular, no
‘pauses’ between words), Carl’s unsupervised
MDL method crunches through the Wall Street Journal and Brown corpuses,
building generatively stored hierarchical
“chunks” at several linguistic levels and “compiling out” common morphological,
syntactic, and even syntactic-semantic patterns compositionally for reuse,
e.g., from “to take off” as [vp
<v><p><np>]◦[v take]◦[p off]◦[np ], all the way to “cranberry
{RED BERRY}” as c◦r◦a◦n◦berry{BERRY+RED} (here the open circle ◦
is a composition operator that’s just string concatenation in the simplest case,
but hierarchical composition for syntax, i.e., merge). As Carl notes, “no previous unsupervised language-learning
procedure has produced structures that match so closely with linguistic
intuitions.” So, just about 20 years
ago, we already had a learning method based on Solomonoff-type induction that
really worked – on full-scale corpora too.
What happened? Carl’s approach
was never really surpassed. A few people (most notably John Goldsmith) embraced
the MDL idea for their own work and a few more picked up on Carl’s
compositional & probabilistic hierarchical MDL-“chunks” approach. But for
the most part it seems as though the AI and computer science & learning
community fell into its usual bad habit of failing to stand on the shoulders of
past success – hard to tell whether one should chalk this up to a familiar need
to always appear sui generis, or to
ensure that the MDL wheel would be faithfully rediscovered every generation to
the same great surprise, or to just plain scholarly laziness. Whatever. Here’s
hoping it won’t be another 20 years before we discover MDL yet again. It
remains to be seen whether we can cobble together a better fitting shoe[NH3] .
[1]Actually ternary if we
assume that any of the 3 aux forms can go in each position.
[2]But maybe not so much. One might rightly ask whether there could be
a universal representation or coding
method that could cover any
computational agent’s representational powers – Martian or otherwise. The
answer is, “Yes, but.” Solomonoff (1960) was the first to show that a universal prior does exist – but that it
is noncomputable. Yet this too can be fixed. For additional discussion, see
Berwick, 1982; 1985; and Li & Vitányi, 2008.
[3]Contemporary because, as
Kiparsky (2007) points out, the Pāṇini grammarians (400 BCE) also developed a
simplicity principle for their version of generative grammar based on the idea
that a rule should only be posited when it “saves” more feature specifications
than it “costs” to memorize a form literally. A special thanks to Ezer Rasin
for pointing me to Kiparsky’s recent lecture about this, here:
http://web.stanford.edu/~kiparsky/Papers/paris.pdf
[4]But this can be done;
exactly this measure has been used by computer science theorists; see, e.g.,
Ginsberg & Lynch, 1977. Derivation complexity in context-free grammar
forms. SIAM J. Computing, 6:1,
123-138. I talk about this a bit more in
my thesis (Berwick, 1982; 1985).
[5]The Russian
mathematician Kolmogorov independently arrived at the same formulation in the
early 1960s, as did Chaitin (1969). See Li and Vitányi, Introduction to Kolmogorov Complexity and its Applications,
Springer, 2008, for an excellent exposition of this field.
[6]A bit more carefully,
following Shannon’s account (1948), if there are n symbols in our alphabet, and the grammar is m symbols long, transmitting a “message” corresponding to our
grammar would take about mlog2
n bits to “transmit” this grammar to somebody else
[7]This is not as easy to
do as it might appear – it took Solomonoff several tries. See Li and Vitányi 2008, pp. 383-389 for
details about this.
[8]More properly, the we
ignore the rewrite arrow and any details about the coding length for different
symbols, Optimal coding schemes would alter this slightly.
[9]Again, this is not as straightforward; see Li & Vitanyí,
pp. 383-389 for a discussion of the additional conditions that must be imposed
in order to ensure the interchangeability between Kolmogorov complexity and
MDL.
[NH1]I
assume that there are better and worse ways to do this as well? So more compact
descriptions of the data are better? Or
this not so? Maybe an example wrt aux system would help the illiterate like me.
[NH2]So
best compresses the PLD and the G?
[NH3]Really
nice. You might want to do a follow up showing how some of this worked in an
example or two. It’s really neat stuff. Excellent.
I had one of those "Two legs good, four legs bad" moments when I read this post.
ReplyDeleteSo I count myself as a fan of de Marcken's work, with maybe more reservations than you, because I don't think it works that well, compared to standard Bayesian approaches,
and I am not 100% convinced by the small is good mantra (particularly given the success of instance based learning and ensemble methods).
But why is this work better/different from all of the other approaches? I would have thought that Norbert and Chomsky would hate this stuff.
Statistical analysis of raw corpora, no semantics in the input, no parameters in the model, an unrestricted hypothesis space, an empirical evaluation, and it can't learn parasitic gaps...
Why aren't you interested in more modern Bayesian work like the unsupervised part of
Cohn, Blunsom and Goldwater (here) from 2010?
Funny - I had the "did we read the same work?" moment when I read the comment. A lot was, um, compressed in Carl's work, highlighted more in his 'composition and perturbation' paper - the forerunner of the CBG 'tree substitution method' you refer to (though par for the course, CBG take no note of such prior art). Why better? Because it was the first, AFAIK, because it produced a better compression than gzip; because it could go all the way from textually represented phonemes (*not* "raw corpora", whatever that might mean), all the way to syntax and semantics (and so *did* use semantics in the input, viz., cranberry, {BERRY, RED}).
ReplyDeleteThe entire point of the post was that the 'small is beautiful mantra' tells only half the story. Empirical evaluation - which to me means accounting for data - is part of the rest, and I'm sure Norbert subscribes to *that* - ask him. I’ll leave comparisons between MDL and Bayesian inference for the experts Li and Vitanyi. As for modernity – I’m all for modernity when it clarifies; the introduction of nonparametric methods and alternative prior schemes has proved to be a technological boost generally. But somehow I remain unimpressed by the gain so far, partly because the return to dependency grammars, for me at least, a recidivism, stepping backwards and throwing away what we’ve learned about linguistic constraints. A topic for another day.
Norbert is fine with this stuff. It will all depend on what we count as the "data" and what predicates we use to measure size of G. In other words, I have nothing against evaluation measure theories per se. The argument against them was their feasibility, not their relevance. So, I can sign onto the view that the smallest theory that handles all the relevant data is the best (or at least an excellent candidate). Where Norbert has argued with Alex, and I suspect will continue to do so, is what is the relevant data, and what counts as accounting for it. These to one side, we can come to a kumbaya moment, though it will be of little interest given that we've abstracted from the central problem.
ReplyDeleteI was being a bit flippant but with a serious point -- I think the de Marcken style of doing things is a reasonable way of approaching the problem and I did some work along these lines back in the day. But it seems to be an example of exactly the sort of work that Norbert has often attacked. But let me channel my inner Norbert --
Delete"the central problem is to account for language acquisition given that what we know about generative grammar is correct. But the de Marcken approach doesn't account for any of the generalisations that GGers have discovered and it can't learn any of the syntactically interesting phenomena -- parasitic gaps, ECP effects etc etc. So it isn't even a starter. "
Actually, my recollection his that Carl had things to say about headedness and its virtues. This was what I really liked a lot. And still do.
DeleteIf one is not comfortable with MDL as minimizing grammar size plus data description, there's a slightly different version that formulates the same principle in a way that might be even closer to what linguists often think they are doing: minimizing grammar size and the length of *exceptions* to the grammar. Li and Vitanyí discuss this on p. 397-398
ReplyDeleteI wonder if another alternative presentation might not also be workable, and more appealing to certain kinds of people, which is that what is minimized is not size as such, but the rate of growth of size (of grammar+data specification) as the data rolls in. The motivation being that this feels 'dynamic' (more appealing in some quarters), but also might even be closer to correct, in that children, with their vast capacity to learn copious amounts of irregular morphology, seem very unlikely to me to care at all about grammar size as such, but might well notice how much work they have to do to accomodate new data in the system. For example, the child who has learned 'blunderbuss' and the plural rule only has to add pointers to the lexical item and the PL morpheme to specify the occurrence of 'blunderbusses' in the story she's hearing, while the one who hasn't will have to add a fully specified lexical item. (McCawley made the same suggestion w.r.t the original evaluation metric, iirc)
ReplyDelete