The talk on the street is that linguists should stuff their portfolios with probabilistic grammars as they have intrinsic advantages over the old, stogy under-performing non-probabilsitic grammars of yore. Partisans assure us that going probabilistic makes the learning problem easier, reduces the need for rich innate structures, and even lowers cholestoral levels. Though I am genetically inclined to skepticism (especially about things that sound too good to be true), debunking these claims has always required talents above my pay grade. Lucky for me I know people who understand these issues deeply and are able to translate their significance for the unwashed, i.e. me. This is all preamble for the following post by Bob Berwick, someone that I have been able to persuade to post on an occasional basis, (hopefully not too occasionally) especially on technical topics from computational linguistics. The bottom line is that ideas that are too good to be true aren't, though their intellectual half lives can be very very long, as Bob explains below. Btw, I stole the title for this post from Bob's apposite reference to John Quiggin, an Australian economist who has done yeoman's work debunking hard to kill bad ideas in macroeconomics.
Going off the Gold Standard
Robert C. Berwick
You don't have to wander very far along
the computational linguistics byways these days to stumble across what
economist John Quiggin dubs “zombie ideas” - bad memes like “supply-side
economics after the Clinton boom and the Bush bust – repeatedly refuted with
evidence and analysis" yet somehow always rising from the dead. Among these, two stand out, virtually
articles of probabilistic faith: One, that Gold’s (1967) celebrated results
about the non-learnability of most language families given positive-only
evidence become moot, once one “goes probabilistic.” And two, that “going probabilistic” can be cashed out simply by adding probabilities
to context-free rules, rendering probabilistic context-free grammars easier to
learn than their context-free counterparts.
Easy enough, in fact, to drive a stake through the heart of Demon Rationalism
(aka, prior constraints on grammars).
But is any of this true? Or is it
instead, to quote the now-immortal words of Meygn Kelly on election night, just
“math you do as a Republican to make yourself feel better”?
Well, Yes and
No. Yes, in that nearly every thoughtful
scholar of language learnability, starting just a few years post Gold-rush,
bound up in the foundational work of Jay Horning, Jerry Feldman, and Ken Wexler,
in the late 1960s, and up to the present day, has stressed that the tough Gold
standards for learning exact
identification of languages on all possible
positive texts – are cognitively implausible and overly restrictive, so should
be relaxed in favor of what's called “probabilistic example presentation” and
“probabilistic convergence.” That was,
and still is, the consensus view. In
other words, from the very beginning of serious research into language learnability
for natural languages, the probabilistic view has stood front and center. All agree we’re after a learning standard
that ensures much more: feasibility.
(Chomsky's words, 1965, pp. 53-54) or what Ken Wexler (1980) termed “easy
learnability”, that is, “learnability from fairly restricted primary data, in a
sufficiently quick time, with limited use of memory” (p. 18).[1]
[2]
In probabilistic
presentation, a learner gets sentences drawn from some underlying distribution,
and only has to succeed on positive example sequence that it gets with
sufficient probability, rather than on all positive example sequences, as with
Gold. And in fact, there’s a technical sense in which this move does
the trick. When learnability on all
texts – the Gold standard – is relaxed in favor of what Ken Wexler was first to
call “measure-1 learnability” at a Stanford workshop in September, 1970, then,
unsurprisingly, moving the goalposts changes the game. What perhaps is surprising is this change seemingly moves the goalposts almost
to the line of scrimmage: now, instead of no
interesting language classes being learnable, all interesting language classes are learnable (officially, all
recursively enumerable language families).
Why?
Picture sentences being spit out one at a time by some underlying
stochastic source (aka, the “I blame the parents” model), so every sentence
occurs with some probability. (Assuming independent draws – what’s technically
called “i.i.d” – draws from an independent and identically distributed stochastic
source). In that case, a learner cannot
count on using some vanishingly rare example sentence as a red flag for their
correct target language. So, we don’t want to insist, as Gold did, that the
learner must succeed on all
texts. If figuring out that your
language is English means having to wait for the one David Foster Wallace
sentence that begins “Uninitiated adults who might...” and trundles to a close
90 words later with, “subdued, almost narcotized-looking,” then you know you’re
goose is cooked. However, if a learner
knows that all longer and longer
sentences always become less and less
likely, then it can know with high confidence that e.g., the first 100
sentences are more than enough to sort out what particular target language your
source has been pitching.
And this is exactly what Horning (1969)
did. He attached probabilities to context-free
grammar rules, to get probabilistic context-free grammars (PCFGs). So for instance, if there are 2 rules in a
grammar expanding NPs, NP → Det N and NP → Pronoun, then we have a probability of 0.5 for each. Since every rule in a PCFG is assumed to be
statistically independent, the probability of any particular sentence is simply
the product of the probabilities of the rules deriving it. As sentences get longer and longer, their
probabilities must shrink – they become exponentially less likely. If the
learner knows this in advance, then it knows that very long sentences are so
unlikely that they can be effectively eliminated, with the problem reduced to
essentially that of learning to identify a family of finite languages. But we
already knew that finite language families are learnable from positive-only
examples.[3]
In fact, as
Horning (1969, 80-81) suggests, one need not rely on any conventional probability
distribution over context-free grammars and languages; one could substitute,
for example, a measure that ascribes exponentially lower probabilities to
larger grammars, and search for target grammars in order of increasing length,
or even a probability measure that just assigns the value 2–i to the grammar enumerated at
step i, Gi (as discussed by Ken Wexler, 1980, fn. 22, 529-532;
this is one way to develop the notion of an evaluation metric, something we’ll consider
in later posts). More generally, any so-called
uniformly computable probability
measure works (as proved in Dan Osherson, Michael Stob, and Scott Weinstein’s
book, 1986, chapter 10 pp. 187-188, and by Angluin 1988).
So, are we done?
Does adding probabilities to context-free rules really make them easier to
learn? No, there's a catch. TANSTAAFL
(viz. There ain’t no such thing as a free lunch!). The measure-1 result holds scant cognitive interest (though it can be
said to point the way to what constraints we really do need for human language
learning models). Why? The positive
results for measure-1 learnability only work if the learner has extremely strong prior knowledge about the probability distribution that's feeding
it sentences. If you really want the
learner’s tabula to be a little more rasa, then as the late Partha Niyogi put
the matter, “when one considers statistical learning...all statistical
estimation algorithms worth their salt are required to converge in a
distribution-free sense” (Niyogi, 2005, p. 67).
So that’s to say, suppose the learner doesn’t have any pre-conceived
(and accurate) notion of how the example sentences will be distributed. Problem
is, Niyogi then shows, as per Angluin (1988), that distribution-free
statistical learning collapses back to Gold learning: if a family of languages
is measure-1 learnable in a distribution-free sense, then it is Gold
learnable. Measure-1 learnability
doesn't enlarge the class of learnable languages one jot. Oops.
As if that weren’t
enough, the assumptions blow away any remaining cognitive fidelity we thought
we had won. For one thing, the
statistical independence of PCFG rules is precisely what has been explicitly
and strongly rejected by the same
folks who embrace Horning, in the recent flood of parsers statistically trained
in a supervised way from pre-parsed corpus data. Rather, such rules are plainly dependent:
once you decide a sentence is NP VP, then it’s far more likely that the NP,
being a subject, will be expanded as a pronoun, with the reverse true if the
NP’s an object – which is just to say that PCFG rules aren’t really
statistically independent. The
assumption that sentences are pitched to the learner in i.i.d fashion is plainly
wrong. Nobody believes parents speak this
way.
Then too,
Horning’s method for selecting grammars – explicit enumeration along with search
using Bayes’ rule to hunt for the highest posterior probability grammar given
the examples, sometimes collapsing phrase names together and sometimes
splitting them apart – proved to be computationally intractable. (It’s also not a wit different conceptually
from what’s been advanced by contemporary Bayesian enthusiasts as far as anyone
can tell.) Horning himself remarks on how slow his “550 line Lisp
program,” “EVALUATER”, ran, in part
because “the enumerative problem is immense” (pp. 199-120). It could only
(partly) find grammars with 2 or 3 nonterminals. To be sure, computation speed
has increased by leaps and bounds over the past 35 years from Lisp/36 running
on an IBM 360/67. And computer scientists have come up with faster methods to
solve Bayesian problems, often approximate – ‘variational Bayes’ and the like. Nevertheless, the immense space of possible
PCFGs – larger than the space of CFGs – confounds search methods to this day.
So in truth, Horning’s
enumerative method doesn’t nearly go far enough. PCFGs don’t turn out to be any
“easier to learn” than their non-probabilistic counterparts – perhaps what’s meant here is that it’s more straightforward to use known techniques
to estimate PCFG probabilities. However,
that’s not the same thing as learning the underlying rules themselves. In fact, estimating language distributions is harder in principle
that estimating languages simpliciter,
as Niyogi (2005, p.68) observes. What’s
needed, rather, is far, far, stronger medicine, something along the lines of
Ken Wexler's “bounded degree of error” approach, in which the possible mistakes
a learner can detect with respect to sentences of at most 2 embeddings spans
the entire space of all possible errors
across a complete family of transformational grammars. Crucially, Ken’s approach is not enumerative. Instead, it carries out a search in a
tightly-bounded, finite space, with each learning step incrementally improving
a single transformational component, rather than selecting and throwing away
wholesale entire grammars. But to get this to work, Wexler found out that one
must impose far stronger a priori
constraints on the family of grammars or languages.
In our internet
era where it sometimes feels as though a veil of ignorance has been drawn
across any work that was accomplished before 1990, it is well worth remembering
that by that September 1970 workshop held at Stanford, Wexler and colleagues
had already set up this model for learning transformational generative grammar
where both presentation and convergence were probabilistic, in just the sense
described by modern day probability enthusiasts – and found this was not nearly
enough. One had to include constraints
on grammars, constraints that wound up to be empirically attested, like locality
constraints on how far one could displace phrases. That work, establishing a positive learnability
result, still stands – though rarely cited by statistical enthusiasts despite
its probabilistic underpinnings. In fact,
it’s apparently even been recently revived just this past year, in work by Mark
Steedman and colleagues (2012), though without any apparent tip of the hat in
Ken’s direction. Mark uses combinatory
categorial grammar instead of transformational grammar, but in many other ways
follows Ken's model, with the learner using (meaning, surface string)
pairs. Plus ça change.
Next: Why
Revolutionary new ideas do appear infrequently.
[1]A partial list of people embracing this
view would include Jay Horning (1969); Henry Hamburger and Ken Wexler (1970,
1973, 1975); Bob Matthews (1979); Peter Culicover and Ken (1980); yours truly
(1982/1985); Dana Angluin (1988); and Partha Niyogi (2005), among others. Here's
Horning, 1969, 19, fn. 1, “In the sequel, when we prove identifiability in the
limit with text presentation, it is with a different performance requirement,
and a different condition on the presentation.”
[2]Here is the quote from Wexler's
presentation at a September, 1970 workshop held at Stanford p.159 “A language
class is identifiable in the limit with probability 1 with respect to a
probabilistic presentation scheme if there exists a learning procedure such
that for any member of the class there is a subset of measure 1, of the set of presentation
sequences for which the procedure identifies in the limit the language."
[3]In fact, Horning, in what seems to have
gone largely unnoticed by statistical enthusiasts stresses several times that
his results hold only for the case of unambiguous context-free grammars: 1969:
32-33: “In the sequel we assume...that unambiguous grammars are desired, and will
reject grammars which make any sample string ambiguous.” Of course, on the assumption that natural
languages are ambiguous, this immediately makes his results unacceptable from
the standpoint of cognitive fidelity.
It’s hard to know just why this has been overlooked for 45 years. Putting aside the possibility that the people
citing Horning in fact haven't really read his thesis, luckily, this omission
is of little importance, because Horning simply didn’t know how to place a
properly-defined probability measure on the languages generated by ambiguous
context-free grammars, a matter that has since been remedied by Osherson, Stob,
and Weinstein (1986), who established a far more general result encompassing
Horning’s, as noted below.
Hi Bob/Norbert,
ReplyDeleteInteresting post on a topic dear to my heart.
Could you expand on why you think distribution free learning is an appropriate model in this case? And what you mean by it, because you seem to use it in two completely different senses.
I know Partha claimed this but it seems so obviously wrong that I think I must be misunderstanding his claim, given that he knew his beans. Though he didn't actually provide an argument for his position.
And what has this got to do with learning PCFGs? where the distributions are generated by PCFGs (duh!) and which is therefore not distribution free.
Hi Alex, hat's a fair point. Partha's comment about distribution-free learning is meant to indicate that if one goes entirely tabula rasa -- so that the learner knows nothing whatsoever about what the distribution of sentences might be -- then measure-1 learnability collapses to Gold learnability. If one is willing to ascribe prior knowledge to learners by the way of assumptions regarding distributions, then one can purchase more success. Among others, Angluin doesn't seem to find this too opprobrious. However, as the blog said, the measure-1 learnability result (and Horning's much narrower version of it), while technically justifiable, results in models that are so unrealistic that they are only of mathematical, not empirical, interest. And that's Dan Osherson's view. Presentation via stochastically chosen members of a language, uniformly computable AND with sentences selected in i.i.d fashion, doesn't accord with what we know about the empirical facts regarding input to children. Like I said, nobody talks this way. So that includes as a subcase PCFGs. Now, if there is a learnability result for PCFGs that works while sticking more closely to empirically justified rather than mathematically justified assumptions, then that would be a real advance -- but that's not what's usually alluded to when people mention Horning and PCFGs in the same breath. For what it's worth, my own take looking backwards at 40+ years of Gold and formal learnability results is: where's the beef? That is, what has this approach really told us about the nature of human language and children? And there, despite some salutary effects for cognitive science generally about how to think a bit more clearly, for me the bottom-line is: not much. But others may well have a different view.
ReplyDelete"Presentation via stochastically chosen members of a language, uniformly computable AND with sentences selected in i.i.d fashion, doesn't accord with what we know about the empirical facts regarding input to children."
ReplyDeleteSo certainly the iid assumption is false -- but you can replace it if you like with something weaker, if you like (ergodic and rapidly mixing). I don't think that's a big deal -- it's an idealisation of a classic type.
But the computable distribution thing seems reasonable. I mean, the distributions don't come out of thin air but represent the linguistic behaviour of adult speakers. So assuming that humans are computable (which again is fairly standard) aren't the distributions going to be computable? And even if you have some noncomputable randomness in the environment, Angluin's results are a bit more general than just uniformly computable distributions.
In any event we can look at the distributions of CDS in Childes -- and see for example, that the distribution of lengths does tail off exponentially.
So this is where I think it radically differs from the normal situation -- CDS is generated by native speakers, not by the rustling of leafs or a random number generator. So the distribution is going to have some properties that depend on the I-language: obviously PCFGs are way too simple as a model here, but some model will presumably be ok, namely the one where we view the whole brain as a probabilistic Turing machine.
Maybe I am missing the point -- but which 'empirical facts' are you talking about? Other than iid.
I kind of feel there are two different arguments here -- one is about whether PCFGs can be learned, and the second is whether PCFGs are a good model of the distribution of sentences.
Angluin shows that the answer to the first is yes even if the learner has the weakest possible prior knowledge (apart from the HUGE issue of computational complexity). The second one is clearly no, since CFGs aren't an adequate model of syntax.
I agree that the measure 1 learnability is irrelevant, for pretty much the same reasons that the Gold model is. I.e. the class of distributions is way too big just like the class of presentations in the Gold model is way too big.
But let me ask one question -- you say at the beginning that you favour probabilistic models of learning, and the you reject the Gold models, but you reject the iid assumption and you reject limits on the classes of distributions and so on. So what model do you favour? We are going to have to idealise in some way to make progress -- so being positive what sorts of probabilistic models do you favour?
Being constructive, rather than banging on about the indubitable limitations of Horning's paper, there is a recent paper in Computational Linguistics "Empirical Risk Minimization for
DeleteProbabilistic Grammars: Sample
Complexity and Hardness of Learning" by Noah Smith and Shay Cohen, that discusses the learnability issues of PCFGs in detail. Do you accept the conclusions of that paper? Do you think that learning model is reasonable?
By the way, I completely agree with your comments about computational complexity -- I think this is the really crucial problem; completely neglected by Bayesians in general.
ReplyDeleteYes but the Bayesians do seem to me to have shown that there is no real need for certain kinds of principles and parameters that don't seem to have worked out very well. For example my 'Morphological Blocking Constraint' from 1990, which is supposed to explain why in Irish, 'analytic' forms consisting of a non-person-marked verb followed by a pronoun don't usually don't freely alternate with 'synthetic' ones (with a person-marked form) when both are available (for reasons discussed in the article, movement rules don't do a very good job with this data).
ReplyDeleteBut the principle has various problems, some of the synthetic forms do alternate, some of them are 'less common' etc, &, for a Bayesian learner, the best analysis would be the one that correctly mirrored the statistics of the community, even if it required some stipulative constraints in the morphology to supress synthetic forms in certain cases.
Another example would be that that-trace constraint, which was supposed to need some clever principles to cause it to exist without negative data, but none of the attempts to make it follow from anything seem to have worked out (Ash Asudeh's 2009 LFG conference paper).
I wasn't clear -- I am a fan of probabilistic reasoning in general and therefore of Bayesian models, I just think they neglect the computational issues. There is a lot of rather murky stuff going on in the details of the sampling algorithms they use to do inference in these models. In some cases this isn't a problem (Morphology, POS models and so on) -- in other cases, like PCFG induction there is a big problem.
DeleteAs a descriptive syntactician, I naturally think that if you got the grammatical prior aka notation right, the computations to find the right grammar might be more doable (but the correct ones might still be too hard for our current technology). The Bayesian notion of 'fit' of data to grammar is a major upgrade over Chomsky's, at least in the minds of the average working syntactician, to whatever extent it may have been present in earlier computational work.]
DeleteI do not understand why people are still so fixated on PCFGs as anything other than an elementary example, since they are hopelessly lame in so many obviously important ways.
The richer formalisms, like MCFGs/LCFRSs/MGs and so are just a lot harder to work with on a practical level. The parsing algorithms are more complex, there are issues with training models. I have seen a few recent papers though using formalisms that fix the linguistic problems of PCFGs -- a paper by Laura Kallmeyer on Data-Driven parsing with LCFRSs (on German) and a paper by Jenny Finkel on Efficient, Feature-based, Conditional Random Field Parsing,
Deletewhich uses a feature based representation. But there are technical problems which are hard to overcome.
From a theoretical point of view, all of the probabilistic variants of these formalisms are learnable in the sense that PCFGs are learnable above --- i.e. if you neglect the computational bounds.
I think there have been *huge* technical advances in our understanding of mildly context sensitive formalisms in the last 5 years. This will feed into more interesting models in the near future -- viz Ed Stabler's work on parsing with MGs.