Monday, November 19, 2012

Grammatical Zombies

The talk on the street is that linguists should stuff their portfolios with probabilistic grammars as they have intrinsic advantages over the old, stogy under-performing non-probabilsitic grammars of yore.  Partisans assure us that going probabilistic makes the learning problem easier, reduces the need for rich innate structures, and even lowers cholestoral levels. Though I am genetically inclined to skepticism (especially about things that sound too good to be true), debunking these claims has always required talents above my pay grade.  Lucky for me I know people who understand these issues deeply and are able to translate their significance for the unwashed, i.e. me. This is all preamble for the following post by Bob Berwick, someone that I have been able to persuade to post on an occasional basis, (hopefully not too occasionally) especially on technical topics from computational linguistics. The bottom line is that ideas that are too good to be true aren't, though their intellectual half lives can be very very long, as Bob explains below. Btw, I stole the title for this post from Bob's apposite reference to John Quiggin, an Australian economist who has done yeoman's work debunking hard to kill bad ideas in macroeconomics.

                                                             Going off the Gold Standard
                                                                     Robert C. Berwick

You don't have to wander very far along the computational linguistics byways these days to stumble across what economist John Quiggin dubs “zombie ideas” - bad memes like “supply-side economics after the Clinton boom and the Bush bust – repeatedly refuted with evidence and analysis" yet somehow always rising from the dead.  Among these, two stand out, virtually articles of probabilistic faith: One, that Gold’s (1967) celebrated results about the non-learnability of most language families given positive-only evidence become moot, once one “goes probabilistic.”  And two, that “going probabilistic” can be cashed out simply by adding probabilities to context-free rules, rendering probabilistic context-free grammars easier to learn than their context-free counterparts.  Easy enough, in fact, to drive a stake through the heart of Demon Rationalism (aka, prior constraints on grammars).  But is any of this true?  Or is it instead, to quote the now-immortal words of Meygn Kelly on election night, just “math you do as a Republican to make yourself feel better”?
Well, Yes and No.  Yes, in that nearly every thoughtful scholar of language learnability, starting just a few years post Gold-rush, bound up in the foundational work of Jay Horning, Jerry Feldman, and Ken Wexler, in the late 1960s, and up to the present day, has stressed that the tough Gold standards for learning exact identification of languages on all possible positive texts – are cognitively implausible and overly restrictive, so should be relaxed in favor of what's called “probabilistic example presentation” and “probabilistic convergence.”  That was, and still is, the consensus view.  In other words, from the very beginning of serious research into language learnability for natural languages, the probabilistic view has stood front and center.  All agree we’re after a learning standard that ensures much more: feasibility. (Chomsky's words, 1965, pp. 53-54) or what Ken Wexler (1980) termed “easy learnability”, that is, “learnability from fairly restricted primary data, in a sufficiently quick time, with limited use of memory” (p. 18).[1] [2]
In probabilistic presentation, a learner gets sentences drawn from some underlying distribution, and only has to succeed on positive example sequence that it gets with sufficient probability, rather than on all positive example sequences, as with Gold.  And in fact, there’s a technical sense in which this move does the trick.  When learnability on all texts – the Gold standard – is relaxed in favor of what Ken Wexler was first to call “measure-1 learnability” at a Stanford workshop in September, 1970, then, unsurprisingly, moving the goalposts changes the game.  What perhaps is surprising is this change seemingly moves the goalposts almost to the line of scrimmage: now, instead of no interesting language classes being learnable, all interesting language classes are learnable (officially, all recursively enumerable language families).

Why?  Picture sentences being spit out one at a time by some underlying stochastic source (aka, the “I blame the parents” model), so every sentence occurs with some probability. (Assuming independent draws – what’s technically called “i.i.d” – draws from an independent and identically distributed stochastic source).  In that case, a learner cannot count on using some vanishingly rare example sentence as a red flag for their correct target language. So, we don’t want to insist, as Gold did, that the learner must succeed on all texts.  If figuring out that your language is English means having to wait for the one David Foster Wallace sentence that begins “Uninitiated adults who might...” and trundles to a close 90 words later with, “subdued, almost narcotized-looking,” then you know you’re goose is cooked.  However, if a learner knows that all longer and longer sentences always become less and less likely, then it can know with high confidence that e.g., the first 100 sentences are more than enough to sort out what particular target language your source has been pitching.

And this is exactly what Horning (1969) did.  He attached probabilities to context-free grammar rules, to get probabilistic context-free grammars (PCFGs).  So for instance, if there are 2 rules in a grammar expanding NPs, NP → Det N and NP → Pronoun, then we have a probability of 0.5 for each.  Since every rule in a PCFG is assumed to be statistically independent, the probability of any particular sentence is simply the product of the probabilities of the rules deriving it.  As sentences get longer and longer, their probabilities must shrink – they become exponentially less likely. If the learner knows this in advance, then it knows that very long sentences are so unlikely that they can be effectively eliminated, with the problem reduced to essentially that of learning to identify a family of finite languages. But we already knew that finite language families are learnable from positive-only examples.[3]
In fact, as Horning (1969, 80-81) suggests, one need not rely on any conventional probability distribution over context-free grammars and languages; one could substitute, for example, a measure that ascribes exponentially lower probabilities to larger grammars, and search for target grammars in order of increasing length, or even a probability measure that just assigns the value 2i to the grammar enumerated at step i, Gi (as discussed by Ken Wexler, 1980, fn. 22, 529-532; this is one way to develop the notion of an evaluation metric, something we’ll consider in later posts).  More generally, any so-called uniformly computable probability measure works (as proved in Dan Osherson, Michael Stob, and Scott Weinstein’s book, 1986, chapter 10 pp. 187-188, and by Angluin 1988).
So, are we done? Does adding probabilities to context-free rules really make them easier to learn?  No, there's a catch. TANSTAAFL (viz. There ain’t no such thing as a free lunch!). The measure-1 result holds scant cognitive interest (though it can be said to point the way to what constraints we really do need for human language learning models).  Why? The positive results for measure-1 learnability only work if the learner has extremely strong prior knowledge about the probability distribution that's feeding it sentences.  If you really want the learner’s tabula to be a little more rasa, then as the late Partha Niyogi put the matter, “when one considers statistical learning...all statistical estimation algorithms worth their salt are required to converge in a distribution-free sense” (Niyogi, 2005, p. 67).  So that’s to say, suppose the learner doesn’t have any pre-conceived (and accurate) notion of how the example sentences will be distributed. Problem is, Niyogi then shows, as per Angluin (1988), that distribution-free statistical learning collapses back to Gold learning: if a family of languages is measure-1 learnable in a distribution-free sense, then it is Gold learnable.  Measure-1 learnability doesn't enlarge the class of learnable languages one jot. Oops.
As if that weren’t enough, the assumptions blow away any remaining cognitive fidelity we thought we had won.  For one thing, the statistical independence of PCFG rules is precisely what has been explicitly and strongly rejected by the same folks who embrace Horning, in the recent flood of parsers statistically trained in a supervised way from pre-parsed corpus data.  Rather, such rules are plainly dependent: once you decide a sentence is NP VP, then it’s far more likely that the NP, being a subject, will be expanded as a pronoun, with the reverse true if the NP’s an object – which is just to say that PCFG rules aren’t really statistically independent.  The assumption that sentences are pitched to the learner in i.i.d fashion is plainly wrong.  Nobody believes parents speak this way.
Then too, Horning’s method for selecting grammars – explicit enumeration along with search using Bayes’ rule to hunt for the highest posterior probability grammar given the examples, sometimes collapsing phrase names together and sometimes splitting them apart – proved to be computationally intractable.  (It’s also not a wit different conceptually from what’s been advanced by contemporary Bayesian enthusiasts as far as anyone can tell.) Horning himself remarks on how slow his “550 line Lisp program,”  “EVALUATER”, ran, in part because “the enumerative problem is immense” (pp. 199-120). It could only (partly) find grammars with 2 or 3 nonterminals. To be sure, computation speed has increased by leaps and bounds over the past 35 years from Lisp/36 running on an IBM 360/67. And computer scientists have come up with faster methods to solve Bayesian problems, often approximate – ‘variational Bayes’ and the like.  Nevertheless, the immense space of possible PCFGs – larger than the space of CFGs – confounds search methods to this day.
So in truth, Horning’s enumerative method doesn’t nearly go far enough. PCFGs don’t turn out to be any “easier to learn” than their non-probabilistic counterparts –  perhaps what’s meant here is that it’s  more straightforward to use known techniques to estimate PCFG probabilities.  However, that’s not the same thing as learning the underlying rules themselves.  In fact, estimating language distributions is harder in principle that estimating languages simpliciter, as Niyogi (2005, p.68) observes.  What’s needed, rather, is far, far, stronger medicine, something along the lines of Ken Wexler's “bounded degree of error” approach, in which the possible mistakes a learner can detect with respect to sentences of at most 2 embeddings spans the entire space of all possible errors across a complete family of transformational grammars.  Crucially, Ken’s approach is not enumerative.  Instead, it carries out a search in a tightly-bounded, finite space, with each learning step incrementally improving a single transformational component, rather than selecting and throwing away wholesale entire grammars. But to get this to work, Wexler found out that one must impose far stronger a priori constraints on the family of grammars or languages. 
In our internet era where it sometimes feels as though a veil of ignorance has been drawn across any work that was accomplished before 1990, it is well worth remembering that by that September 1970 workshop held at Stanford, Wexler and colleagues had already set up this model for learning transformational generative grammar where both presentation and convergence were probabilistic, in just the sense described by modern day probability enthusiasts – and found this was not nearly enough.  One had to include constraints on grammars, constraints that wound up to be empirically attested, like locality constraints on how far one could displace phrases.  That work, establishing a positive learnability result, still stands – though rarely cited by statistical enthusiasts despite its probabilistic underpinnings.  In fact, it’s apparently even been recently revived just this past year, in work by Mark Steedman and colleagues (2012), though without any apparent tip of the hat in Ken’s direction.  Mark uses combinatory categorial grammar instead of transformational grammar, but in many other ways follows Ken's model, with the learner using (meaning, surface string) pairs.  Plus ça change.

Next: Why Revolutionary new ideas do appear infrequently.

[1]A partial list of people embracing this view would include Jay Horning (1969); Henry Hamburger and Ken Wexler (1970, 1973, 1975); Bob Matthews (1979); Peter Culicover and Ken (1980); yours truly (1982/1985); Dana Angluin (1988); and Partha Niyogi (2005), among others. Here's Horning, 1969, 19, fn. 1, “In the sequel, when we prove identifiability in the limit with text presentation, it is with a different performance requirement, and a different condition on the presentation.”
[2]Here is the quote from Wexler's presentation at a September, 1970 workshop held at Stanford p.159 “A language class is identifiable in the limit with probability 1 with respect to a probabilistic presentation scheme if there exists a learning procedure such that for any member of the class there is a subset of measure 1, of the set of presentation sequences for which the procedure identifies in the limit the language."

[3]In fact, Horning, in what seems to have gone largely unnoticed by statistical enthusiasts stresses several times that his results hold only for the case of unambiguous context-free grammars: 1969: 32-33: “In the sequel we assume...that unambiguous grammars are desired, and will reject grammars which make any sample string ambiguous.”  Of course, on the assumption that natural languages are ambiguous, this immediately makes his results unacceptable from the standpoint of cognitive fidelity.  It’s hard to know just why this has been overlooked for 45 years.  Putting aside the possibility that the people citing Horning in fact haven't really read his thesis, luckily, this omission is of little importance, because Horning simply didn’t know how to place a properly-defined probability measure on the languages generated by ambiguous context-free grammars, a matter that has since been remedied by Osherson, Stob, and Weinstein (1986), who established a far more general result encompassing Horning’s, as noted below.


  1. Hi Bob/Norbert,

    Interesting post on a topic dear to my heart.

    Could you expand on why you think distribution free learning is an appropriate model in this case? And what you mean by it, because you seem to use it in two completely different senses.

    I know Partha claimed this but it seems so obviously wrong that I think I must be misunderstanding his claim, given that he knew his beans. Though he didn't actually provide an argument for his position.

    And what has this got to do with learning PCFGs? where the distributions are generated by PCFGs (duh!) and which is therefore not distribution free.

  2. Hi Alex, hat's a fair point. Partha's comment about distribution-free learning is meant to indicate that if one goes entirely tabula rasa -- so that the learner knows nothing whatsoever about what the distribution of sentences might be -- then measure-1 learnability collapses to Gold learnability. If one is willing to ascribe prior knowledge to learners by the way of assumptions regarding distributions, then one can purchase more success. Among others, Angluin doesn't seem to find this too opprobrious. However, as the blog said, the measure-1 learnability result (and Horning's much narrower version of it), while technically justifiable, results in models that are so unrealistic that they are only of mathematical, not empirical, interest. And that's Dan Osherson's view. Presentation via stochastically chosen members of a language, uniformly computable AND with sentences selected in i.i.d fashion, doesn't accord with what we know about the empirical facts regarding input to children. Like I said, nobody talks this way. So that includes as a subcase PCFGs. Now, if there is a learnability result for PCFGs that works while sticking more closely to empirically justified rather than mathematically justified assumptions, then that would be a real advance -- but that's not what's usually alluded to when people mention Horning and PCFGs in the same breath. For what it's worth, my own take looking backwards at 40+ years of Gold and formal learnability results is: where's the beef? That is, what has this approach really told us about the nature of human language and children? And there, despite some salutary effects for cognitive science generally about how to think a bit more clearly, for me the bottom-line is: not much. But others may well have a different view.

  3. "Presentation via stochastically chosen members of a language, uniformly computable AND with sentences selected in i.i.d fashion, doesn't accord with what we know about the empirical facts regarding input to children."

    So certainly the iid assumption is false -- but you can replace it if you like with something weaker, if you like (ergodic and rapidly mixing). I don't think that's a big deal -- it's an idealisation of a classic type.

    But the computable distribution thing seems reasonable. I mean, the distributions don't come out of thin air but represent the linguistic behaviour of adult speakers. So assuming that humans are computable (which again is fairly standard) aren't the distributions going to be computable? And even if you have some noncomputable randomness in the environment, Angluin's results are a bit more general than just uniformly computable distributions.
    In any event we can look at the distributions of CDS in Childes -- and see for example, that the distribution of lengths does tail off exponentially.

    So this is where I think it radically differs from the normal situation -- CDS is generated by native speakers, not by the rustling of leafs or a random number generator. So the distribution is going to have some properties that depend on the I-language: obviously PCFGs are way too simple as a model here, but some model will presumably be ok, namely the one where we view the whole brain as a probabilistic Turing machine.

    Maybe I am missing the point -- but which 'empirical facts' are you talking about? Other than iid.

    I kind of feel there are two different arguments here -- one is about whether PCFGs can be learned, and the second is whether PCFGs are a good model of the distribution of sentences.

    Angluin shows that the answer to the first is yes even if the learner has the weakest possible prior knowledge (apart from the HUGE issue of computational complexity). The second one is clearly no, since CFGs aren't an adequate model of syntax.

    I agree that the measure 1 learnability is irrelevant, for pretty much the same reasons that the Gold model is. I.e. the class of distributions is way too big just like the class of presentations in the Gold model is way too big.

    But let me ask one question -- you say at the beginning that you favour probabilistic models of learning, and the you reject the Gold models, but you reject the iid assumption and you reject limits on the classes of distributions and so on. So what model do you favour? We are going to have to idealise in some way to make progress -- so being positive what sorts of probabilistic models do you favour?

    1. Being constructive, rather than banging on about the indubitable limitations of Horning's paper, there is a recent paper in Computational Linguistics "Empirical Risk Minimization for
      Probabilistic Grammars: Sample
      Complexity and Hardness of Learning" by Noah Smith and Shay Cohen, that discusses the learnability issues of PCFGs in detail. Do you accept the conclusions of that paper? Do you think that learning model is reasonable?

  4. By the way, I completely agree with your comments about computational complexity -- I think this is the really crucial problem; completely neglected by Bayesians in general.

  5. Yes but the Bayesians do seem to me to have shown that there is no real need for certain kinds of principles and parameters that don't seem to have worked out very well. For example my 'Morphological Blocking Constraint' from 1990, which is supposed to explain why in Irish, 'analytic' forms consisting of a non-person-marked verb followed by a pronoun don't usually don't freely alternate with 'synthetic' ones (with a person-marked form) when both are available (for reasons discussed in the article, movement rules don't do a very good job with this data).

    But the principle has various problems, some of the synthetic forms do alternate, some of them are 'less common' etc, &, for a Bayesian learner, the best analysis would be the one that correctly mirrored the statistics of the community, even if it required some stipulative constraints in the morphology to supress synthetic forms in certain cases.

    Another example would be that that-trace constraint, which was supposed to need some clever principles to cause it to exist without negative data, but none of the attempts to make it follow from anything seem to have worked out (Ash Asudeh's 2009 LFG conference paper).

    1. I wasn't clear -- I am a fan of probabilistic reasoning in general and therefore of Bayesian models, I just think they neglect the computational issues. There is a lot of rather murky stuff going on in the details of the sampling algorithms they use to do inference in these models. In some cases this isn't a problem (Morphology, POS models and so on) -- in other cases, like PCFG induction there is a big problem.

    2. As a descriptive syntactician, I naturally think that if you got the grammatical prior aka notation right, the computations to find the right grammar might be more doable (but the correct ones might still be too hard for our current technology). The Bayesian notion of 'fit' of data to grammar is a major upgrade over Chomsky's, at least in the minds of the average working syntactician, to whatever extent it may have been present in earlier computational work.]

      I do not understand why people are still so fixated on PCFGs as anything other than an elementary example, since they are hopelessly lame in so many obviously important ways.

    3. The richer formalisms, like MCFGs/LCFRSs/MGs and so are just a lot harder to work with on a practical level. The parsing algorithms are more complex, there are issues with training models. I have seen a few recent papers though using formalisms that fix the linguistic problems of PCFGs -- a paper by Laura Kallmeyer on Data-Driven parsing with LCFRSs (on German) and a paper by Jenny Finkel on Efficient, Feature-based, Conditional Random Field Parsing,
      which uses a feature based representation. But there are technical problems which are hard to overcome.

      From a theoretical point of view, all of the probabilistic variants of these formalisms are learnable in the sense that PCFGs are learnable above --- i.e. if you neglect the computational bounds.
      I think there have been *huge* technical advances in our understanding of mildly context sensitive formalisms in the last 5 years. This will feed into more interesting models in the near future -- viz Ed Stabler's work on parsing with MGs.