The talk on the street is that linguists should stuff their portfolios with probabilistic grammars as they have intrinsic advantages over the old, stogy under-performing non-probabilsitic grammars of yore. Partisans assure us that going probabilistic makes the learning problem easier, reduces the need for rich innate structures, and even lowers cholestoral levels. Though I am genetically inclined to skepticism (especially about things that sound too good to be true), debunking these claims has always required talents above my pay grade. Lucky for me I know people who understand these issues deeply and are able to translate their significance for the unwashed, i.e. me. This is all preamble for the following post by Bob Berwick, someone that I have been able to persuade to post on an occasional basis, (hopefully not too occasionally) especially on technical topics from computational linguistics. The bottom line is that ideas that are too good to be true aren't, though their intellectual half lives can be very very long, as Bob explains below. Btw, I stole the title for this post from Bob's apposite reference to John Quiggin, an Australian economist who has done yeoman's work debunking hard to kill bad ideas in macroeconomics.
Going off the Gold Standard
Robert C. Berwick
You don't have to wander very far along the computational linguistics byways these days to stumble across what economist John Quiggin dubs “zombie ideas” - bad memes like “supply-side economics after the Clinton boom and the Bush bust – repeatedly refuted with evidence and analysis" yet somehow always rising from the dead. Among these, two stand out, virtually articles of probabilistic faith: One, that Gold’s (1967) celebrated results about the non-learnability of most language families given positive-only evidence become moot, once one “goes probabilistic.” And two, that “going probabilistic” can be cashed out simply by adding probabilities to context-free rules, rendering probabilistic context-free grammars easier to learn than their context-free counterparts. Easy enough, in fact, to drive a stake through the heart of Demon Rationalism (aka, prior constraints on grammars). But is any of this true? Or is it instead, to quote the now-immortal words of Meygn Kelly on election night, just “math you do as a Republican to make yourself feel better”?
Well, Yes and No. Yes, in that nearly every thoughtful scholar of language learnability, starting just a few years post Gold-rush, bound up in the foundational work of Jay Horning, Jerry Feldman, and Ken Wexler, in the late 1960s, and up to the present day, has stressed that the tough Gold standards for learning exact identification of languages on all possible positive texts – are cognitively implausible and overly restrictive, so should be relaxed in favor of what's called “probabilistic example presentation” and “probabilistic convergence.” That was, and still is, the consensus view. In other words, from the very beginning of serious research into language learnability for natural languages, the probabilistic view has stood front and center. All agree we’re after a learning standard that ensures much more: feasibility. (Chomsky's words, 1965, pp. 53-54) or what Ken Wexler (1980) termed “easy learnability”, that is, “learnability from fairly restricted primary data, in a sufficiently quick time, with limited use of memory” (p. 18). 
In probabilistic presentation, a learner gets sentences drawn from some underlying distribution, and only has to succeed on positive example sequence that it gets with sufficient probability, rather than on all positive example sequences, as with Gold. And in fact, there’s a technical sense in which this move does the trick. When learnability on all texts – the Gold standard – is relaxed in favor of what Ken Wexler was first to call “measure-1 learnability” at a Stanford workshop in September, 1970, then, unsurprisingly, moving the goalposts changes the game. What perhaps is surprising is this change seemingly moves the goalposts almost to the line of scrimmage: now, instead of no interesting language classes being learnable, all interesting language classes are learnable (officially, all recursively enumerable language families).
Why? Picture sentences being spit out one at a time by some underlying stochastic source (aka, the “I blame the parents” model), so every sentence occurs with some probability. (Assuming independent draws – what’s technically called “i.i.d” – draws from an independent and identically distributed stochastic source). In that case, a learner cannot count on using some vanishingly rare example sentence as a red flag for their correct target language. So, we don’t want to insist, as Gold did, that the learner must succeed on all texts. If figuring out that your language is English means having to wait for the one David Foster Wallace sentence that begins “Uninitiated adults who might...” and trundles to a close 90 words later with, “subdued, almost narcotized-looking,” then you know you’re goose is cooked. However, if a learner knows that all longer and longer sentences always become less and less likely, then it can know with high confidence that e.g., the first 100 sentences are more than enough to sort out what particular target language your source has been pitching.
And this is exactly what Horning (1969) did. He attached probabilities to context-free grammar rules, to get probabilistic context-free grammars (PCFGs). So for instance, if there are 2 rules in a grammar expanding NPs, NP → Det N and NP → Pronoun, then we have a probability of 0.5 for each. Since every rule in a PCFG is assumed to be statistically independent, the probability of any particular sentence is simply the product of the probabilities of the rules deriving it. As sentences get longer and longer, their probabilities must shrink – they become exponentially less likely. If the learner knows this in advance, then it knows that very long sentences are so unlikely that they can be effectively eliminated, with the problem reduced to essentially that of learning to identify a family of finite languages. But we already knew that finite language families are learnable from positive-only examples.
In fact, as Horning (1969, 80-81) suggests, one need not rely on any conventional probability distribution over context-free grammars and languages; one could substitute, for example, a measure that ascribes exponentially lower probabilities to larger grammars, and search for target grammars in order of increasing length, or even a probability measure that just assigns the value 2–i to the grammar enumerated at step i, Gi (as discussed by Ken Wexler, 1980, fn. 22, 529-532; this is one way to develop the notion of an evaluation metric, something we’ll consider in later posts). More generally, any so-called uniformly computable probability measure works (as proved in Dan Osherson, Michael Stob, and Scott Weinstein’s book, 1986, chapter 10 pp. 187-188, and by Angluin 1988).
So, are we done? Does adding probabilities to context-free rules really make them easier to learn? No, there's a catch. TANSTAAFL (viz. There ain’t no such thing as a free lunch!). The measure-1 result holds scant cognitive interest (though it can be said to point the way to what constraints we really do need for human language learning models). Why? The positive results for measure-1 learnability only work if the learner has extremely strong prior knowledge about the probability distribution that's feeding it sentences. If you really want the learner’s tabula to be a little more rasa, then as the late Partha Niyogi put the matter, “when one considers statistical learning...all statistical estimation algorithms worth their salt are required to converge in a distribution-free sense” (Niyogi, 2005, p. 67). So that’s to say, suppose the learner doesn’t have any pre-conceived (and accurate) notion of how the example sentences will be distributed. Problem is, Niyogi then shows, as per Angluin (1988), that distribution-free statistical learning collapses back to Gold learning: if a family of languages is measure-1 learnable in a distribution-free sense, then it is Gold learnable. Measure-1 learnability doesn't enlarge the class of learnable languages one jot. Oops.
As if that weren’t enough, the assumptions blow away any remaining cognitive fidelity we thought we had won. For one thing, the statistical independence of PCFG rules is precisely what has been explicitly and strongly rejected by the same folks who embrace Horning, in the recent flood of parsers statistically trained in a supervised way from pre-parsed corpus data. Rather, such rules are plainly dependent: once you decide a sentence is NP VP, then it’s far more likely that the NP, being a subject, will be expanded as a pronoun, with the reverse true if the NP’s an object – which is just to say that PCFG rules aren’t really statistically independent. The assumption that sentences are pitched to the learner in i.i.d fashion is plainly wrong. Nobody believes parents speak this way.
Then too, Horning’s method for selecting grammars – explicit enumeration along with search using Bayes’ rule to hunt for the highest posterior probability grammar given the examples, sometimes collapsing phrase names together and sometimes splitting them apart – proved to be computationally intractable. (It’s also not a wit different conceptually from what’s been advanced by contemporary Bayesian enthusiasts as far as anyone can tell.) Horning himself remarks on how slow his “550 line Lisp program,” “EVALUATER”, ran, in part because “the enumerative problem is immense” (pp. 199-120). It could only (partly) find grammars with 2 or 3 nonterminals. To be sure, computation speed has increased by leaps and bounds over the past 35 years from Lisp/36 running on an IBM 360/67. And computer scientists have come up with faster methods to solve Bayesian problems, often approximate – ‘variational Bayes’ and the like. Nevertheless, the immense space of possible PCFGs – larger than the space of CFGs – confounds search methods to this day.
So in truth, Horning’s enumerative method doesn’t nearly go far enough. PCFGs don’t turn out to be any “easier to learn” than their non-probabilistic counterparts – perhaps what’s meant here is that it’s more straightforward to use known techniques to estimate PCFG probabilities. However, that’s not the same thing as learning the underlying rules themselves. In fact, estimating language distributions is harder in principle that estimating languages simpliciter, as Niyogi (2005, p.68) observes. What’s needed, rather, is far, far, stronger medicine, something along the lines of Ken Wexler's “bounded degree of error” approach, in which the possible mistakes a learner can detect with respect to sentences of at most 2 embeddings spans the entire space of all possible errors across a complete family of transformational grammars. Crucially, Ken’s approach is not enumerative. Instead, it carries out a search in a tightly-bounded, finite space, with each learning step incrementally improving a single transformational component, rather than selecting and throwing away wholesale entire grammars. But to get this to work, Wexler found out that one must impose far stronger a priori constraints on the family of grammars or languages.
In our internet era where it sometimes feels as though a veil of ignorance has been drawn across any work that was accomplished before 1990, it is well worth remembering that by that September 1970 workshop held at Stanford, Wexler and colleagues had already set up this model for learning transformational generative grammar where both presentation and convergence were probabilistic, in just the sense described by modern day probability enthusiasts – and found this was not nearly enough. One had to include constraints on grammars, constraints that wound up to be empirically attested, like locality constraints on how far one could displace phrases. That work, establishing a positive learnability result, still stands – though rarely cited by statistical enthusiasts despite its probabilistic underpinnings. In fact, it’s apparently even been recently revived just this past year, in work by Mark Steedman and colleagues (2012), though without any apparent tip of the hat in Ken’s direction. Mark uses combinatory categorial grammar instead of transformational grammar, but in many other ways follows Ken's model, with the learner using (meaning, surface string) pairs. Plus ça change.
Next: Why Revolutionary new ideas do appear infrequently.
A partial list of people embracing this view would include Jay Horning (1969); Henry Hamburger and Ken Wexler (1970, 1973, 1975); Bob Matthews (1979); Peter Culicover and Ken (1980); yours truly (1982/1985); Dana Angluin (1988); and Partha Niyogi (2005), among others. Here's Horning, 1969, 19, fn. 1, “In the sequel, when we prove identifiability in the limit with text presentation, it is with a different performance requirement, and a different condition on the presentation.”
Here is the quote from Wexler's presentation at a September, 1970 workshop held at Stanford p.159 “A language class is identifiable in the limit with probability 1 with respect to a probabilistic presentation scheme if there exists a learning procedure such that for any member of the class there is a subset of measure 1, of the set of presentation sequences for which the procedure identifies in the limit the language."
In fact, Horning, in what seems to have gone largely unnoticed by statistical enthusiasts stresses several times that his results hold only for the case of unambiguous context-free grammars: 1969: 32-33: “In the sequel we assume...that unambiguous grammars are desired, and will reject grammars which make any sample string ambiguous.” Of course, on the assumption that natural languages are ambiguous, this immediately makes his results unacceptable from the standpoint of cognitive fidelity. It’s hard to know just why this has been overlooked for 45 years. Putting aside the possibility that the people citing Horning in fact haven't really read his thesis, luckily, this omission is of little importance, because Horning simply didn’t know how to place a properly-defined probability measure on the languages generated by ambiguous context-free grammars, a matter that has since been remedied by Osherson, Stob, and Weinstein (1986), who established a far more general result encompassing Horning’s, as noted below.