Monday, August 12, 2013

Bayes Daze I


Ah, the dog daze of August.  Time to bask in the sun while sipping a margarita or two…but, alas, reality intrudes, in this case the recent discussion about Bayesian inference.  Well, as Jerry Fodor once wrote at the beginning of a wonderful rejoinder, “Do you want to know how to tell when you have gotten old? It’s when a cyclical theory of history starts to strike you as plausible. It begins to seem that the same stuff keeps coming around again, just like Hegel said. Except that it’s not ‘transcended and preserved’; it’s just back.”  So, I think I must have gotten old. Loyal blog readers may remember that in my very first post, “Going off the Gold standard” way back in November 2012, here I wrote, “from the very beginning of serious research into language learnability for natural languages, the probabilistic view has stood front and center.”  People quickly realized that Gold’s learnability conditions – exact identification of languages on all possible positive texts – seemed cognitively implausible and overly restrictive, thus ought to be relaxed in favor of, e.g., “probabilistic example presentation” and “probabilistic convergence” (Wexler, 1970). I emphasized that this was the consensus position, as of 43 years ago. Interestingly, the locus classicus of this research, then as well as arguably now, was at Stanford – not only was Ken Wexler working on probabilistic grammar learning (under Pat Suppes), but also Suppes himself was constructing probabilistic context-free grammars (PCFGs) to describe the child language transcripts that would later become CHILDES – e.g., Adam’s grammar for Noun Phrases, dipping his toe into Bayesian waters. Even more germane to Bayes Daze though, there was Jay Horning’s 1969 Stanford thesis, which established two main results, the first being perhaps the most cited from this era – but perhaps also the most misunderstood.[1] Further, it’s not clear that we’ve been able to improve on Horning’s results.  Let me explain. Here is what Horning did:

1.    PCFGs and Bayesian inference. To escape Gold’s demonstration that one could not learn from positive-only examples, Horning used PCFGs along with Bayes’ rule as a way to search for grammars. The ‘best’ grammar G is the one that maximizes the probability of a grammar G given a sequence of examples D, i.e., p(G|D), where D corresponds to a sequence of positive example sentences drawn from the target grammar.  Now as usual, via Bayes’ rule one can rewrite p(G|D) as the prior probability of G times the likelihood, divided by the probability of D: p(G)×p(G|D)/p(D). Also as usual, since D is fixed across all grammars, we can ignore it and just maximize p(G)×p(D|G) to find the best G.  The first term p(G), the prior probability of a grammar G, is presumed to be inversely proportional to the size of the grammar G, so smaller grammars are more likely and so ‘better’, while the second term, the likelihood, tells us how well that particular grammar fits the data.[2] In this setting, Horning proved that one can come up with a learner that converges to the correct target grammar to be learned – in fact, he even wrote a 550 line LISP program INFER to do this. (Amusingly, he called this program “somewhat large,” p. 132.) As far as I can tell, nearly every later commentator has simply taken it for granted that Horning showed “that PCFGs could be induced without negative evidence” – a quote taken from the Juravsky and Martin textbook, 2nd edition, p. 488. But this is not quite true…there are two caveats, which don’t ever seem to get the same airtime. First, Horning deliberately restricted his analysis to the case of unambiguous probabilistic context-free grammars. (In the sequel we assume – except where specifically noted – that unambiguous grammars are desired, and will reject grammars which make any sample string ambiguous” pp. 32-33, my emphasis.)  So, unless you think that natural language is unambiguous, Horning’s result cannot be drunk neat.  Now, this isn’t such a big deal.  The reason Horning deliberately restricted himself was technical. Back in 1969 he didn’t know how to slice up probability mass properly when dealing with ambiguous PCFGs and their derivations so that the total probability would still add up to 1, as is necessary for a proper probability distribution. In the interim, we have learned how to do this; see, e.g., Chi, Z., 1999, “Statistical properties of probabilistic context-free grammars,” JACL, 25:1, 131-160.  So, it should be straightforward to extend Horning’s proof to all PCFGs, and perhaps someone has already done this. (If so, I’m sure some reader can alert us to this.) Nonetheless, most authors (in fact all I know of with one notable exception, and I don’t mean me) don’t even seem to be aware of what Horning actually proved, which I chalk up to yet another instance of the Internet Veil Effect, i.e., perhaps people have never bothered to actually read Horning’s thesis, and all the second- and third-hand reports don’t mention the bit about unambiguous grammars. Second, Horning tweaked the definition of convergence, so that it is weaker than Gold’s original ‘identification in the limit.’ Horning’s learner converges when the probability of guessing the wrong grammar gets vanishingly small – the chance that a non-target grammar is guessed decreases with the number of examples (See p. 80 in his thesis for the definition.)  So, Horning’s learner can still conjecture non-target grammars, no matter how many examples it has seen. Recall that Gold requires that after some finite point only the target grammar (or PCFG equivalents) can be guessed.  You might feel that Horning’s convergence test is psychologically more cogent – related to Wexler’s notion of “measure-1” learnability – and I think you’d be right.  So we’re probably on safe ground here also.  Horning’s second bullet, however, isn’t so easily dodged.

2.   Computational tractability. Horning found that his inference method was computationally intractable, because the space of PCFGs is simply too large. “Although the enumerative Bayesean [sic] procedure presented in Chapter V and refined in later chapters is formally optimal, its Achilles’ heel is efficiency….the enumerative problem is immense” p.151-152. He notes, e.g., that just the number of possible grammars with N nonterminals grows as 2 to the power N3, p. 157.  Here he could find no easy paths to salvation.

Now, what progress have we made since 1969 with regard to these two basic results?

1.     PCFGs and Bayesian inference. Well, we’re still roughly in the same place. Aside from fixing the technical glitch with ambiguous PCFGs, the same framework is generally assumed, without change.  See, e.g., Perfors et al. (2011) for a representative example. The smaller the grammar, the larger its prior probability, announced as a kind of ‘Occam’s razor’ principle (simpler descriptions are better); the ‘fit’ of data to grammar is given as p(D|G) as before; and so forth. We’ll dive into this formulation more in the next post, including its (perhaps surprising) historical antecedents. But it’s easy to see that very little has really changed since Horning’s formulation.  One new wrinkle has been the addition of hierarchical Bayesian models, but we have to defer that discussion for now.

2.    Computational tractability. Same deal here. Current work acknowledges that searching the entire space of PCFGs is intractable, and in fact we now know much more. In general, Bayesian inference is provably intractable (NP-hard); Cooper, 1990; Kwisthout, 2009, 2011; Park & Darwiche, 2004; Shimony, 1994. This means it would require an unrealistic amount of time for anything but small hypothesis spaces. (If you read, e.g., the Perfors et al. paper you’ll see that they complain bitterly about this computational problem all the time, even dropping lots of sentences from their test corpus because it’s all just too complex.) Now, one response to this challenge has been to suggest that the Bayesian computations don’t need to get exact or optimal results. Rather, they only need to approximate exact or optimal solutions, by, say, sampling, or by using heuristics like a majority-vote-wins strategy (Chater et al. 2010; Sanborn, 2010).  At first blush, this all seems like a winning way out of the computational dilemma, but in fact, it doesn’t help. Despite their appeal, “most–if not all–forms of approximating Bayesian inference are for unconstrained input domains as computationally intractable as computing Bayesian inference exactly” Kwisthout and van Rooj (2013, 8, my emphasis). So, “the assumption of ‘approximation’ by itself is insufficient to explain how Bayesian models scale to situations of real-world complexity” (Ibid., 3, my emphasis).

What to do?  Well, this is precisely the same difficulty that was highlighted in the extensive computational complexity analysis of linguistic theories that my students and I carried out in the early 1980s. The take-home lesson from that work, as underscored in our book Computational Complexity and Natural Language 1987, was not that complexity theory should be used as some kind of steroidal spitting contest to pick out the best linguistic theory, though the outside commentary sometimes reads as if this were its real aim.  If that’s all that readers got out of the book, then we failed miserably. We ought to refund you the dime in royalties we got for each copy. Rather, the goal was to use computational complexity theory as a diagnostic tool to pinpoint what part a theory was contributing to its intractability, to be followed by positing constraints to see if one could do better.  For instance, Eric Ristad found that ‘revised’ Generalized Phrase Structure Grammar (Gazdar, Pullum, Sag, Klein, 1985) was insanely complex, in the class of so-called exponential-polynomial time problems – more than exponential time, and so far beyond the class of NP-hard problems that it’s challenging to even conjure up a concrete example of such a beast; see here.  The source of the complexity?  Largely the feature system, which permits all kinds of crazy stuff: nonlocal agreement, unrestricted gaps, and so forth. So, Eric offered up a replacement that drove the complexity down to the class NP. These constraints were linguistically motivated: a strict version of X-bar theory; recoverability of deletion; and constraints on extraction domains.  You can consult Ristad (1989) linked above, or the 1987 book for details.

And that’s the model we’re after here. In just the same way, Kwisthout and van Rooj (2013) propose to apply techniques from the mathematical theory of parameterized complexity theory (Downey & Fellows, 1999) to the problem of (approximate) Bayesian inference.  What this comes down to is this: explicitly include structural properties of the input – usually obvious dependencies such as the number of parents for any given node, or the maximum path between nodes – and see if one can figure out that it’s only these parts of the problem that contribute to the complexity blow-up, and, further, that we can keep these troublesome parameters within bounded ranges – perhaps binary branching, or, as with recoverability of deletion, no nested empty nodes. Kwisthout & Van Rooij, who just ran a tutorial on this very topic at this year’s Cognitive Science Society meeting, also have a published paper showing precisely how this can be done for certain Bayesian inference problems. The work is here is just beginning. (See: “Bridging the gap between theory and practice of approximate Bayesian inference,” 2013. Cognitive Systems Research, 24, 2–8, or their CogSci tutorial slides here.)

Now, just in case you thought that this Berwick guy has it all in for Bayesian inference, you should know that my years slaving away in the biology electrophoretic gel mines have made me very pragmatic, so I’ll grab any tool that works. Norbert noted in his post that Bayesian approaches “seem to have a good way of dealing with what Mark [Johnson] calls chicken and egg problems.”  I agree.  In fact I’ve used it myself for just this sort of thing; e.g., in work that Sourabh Niyogi (not Partha) and I did, reported in a 2002 Cognitive Science Society paper, “Bayesian learning at the syntactic-semantic interface,” here, we showed how easily Bayesian methods can deal with the (to us pseudo) problem of ‘syntactic’ vs. ‘semantic’ bootstrapping by simply dumping both sorts of evidence into the Bayesian hopper without any such labels – because, after all, the kid just gets data, it doesn’t come pre-packaged one way or the other. 

That said, I should confess that I hold the same view on Bayes as CMU statistician Cosma Shalizi: “there are people out there who see Bayes’ Rule as the key to all methodologies, something essential to rationality. Personally, I find this view thoroughly misguided and not even a regulative ideal…” Now partly this is due to my intellectual heritage. I was trained by Art Dempster and John Tukey, both catholics. I can even recall the exact day in October 1973 when Art strolled into Statistics 126 and started scribbling the first version of the EM method on the chalkboard, then finishing it off with a bit on Bayesian hazards. But partly it’s also because I’ve always found ‘universal solutions’ to be just that – usually not.  As I’m sure my philosopher friends like Norbert would agree, if there really were a sure-fire solution to the problem of induction, by now everyone would have heard about it, and such big news would already have made it to the science pages of the NY Times or the Times’ The Stone (well, at least to the New York Review of Books). My MIT colleague, complexity theorist Scott Aaronson, also agrees. In his own blog ‘in response to a question from the floor’ he observes that there are lots of other, non-Bayesian, ways to tackle the learning problem, one being probably approximately correct (PAC-learning): “When you know only one formalism to describe some phenomenon (in this case, that of choosing hypotheses to fit data), it’s easy to talk yourself into believing that formalism is the Truth: to paraphrase Caliph Omar, ‘if it agrees with Bayesianism, it is superfluous; if it disagrees, it is heresy.’ The antidote is to learn other formalisms.”

So what about those other formalisms? Where does Bayesian inference fit in? That’s forthcoming in Bayes Daze II. There, among other things, we will have you play the Google Translation Game, and do a bit of reverse engineering to learn how GT works. For warm up, you might trying running the following French sentence through Google Translate to see what it spits out as English on the other end – and think a bit about why it’s behaving this way, which will tell you a lot about Bayesian inference: Le pomme mange le garçon (‘the apple eats the boy’). Bayes Daze II will also tell you when the Bayesian approach for choosing grammars was first proposed. This answer too may surprise you. OK, back to those drinks…



[1] Jerry Feldman was also working with Horning at Stanford, and his 1969/1970 Stanford AI lab report, Stanford AIM-93, “Some decidability results on grammatical inference,” contains another attempt to formulate a inference probabilistic procedure for probabilistic grammars. I don’t discuss Feldman’s work here because a number of years ago, Partha Niyogi and I went through Feldman’s proof and found that it contained errors; we never took time to go back and see whether it could be fixed up. See also:  Feldman, Horning, Gips, Reder,  Grammatical complexity and inference. Stanford AIM-89, 1969.
[2]More precisely, Horning used a ‘meta-grammar’ – a grammar that generated the various PCFGs, roughly as in Perfors, Tenenbaum & Regier, 2011 in Cognition and along the lines of Fodor & Sakas’ work on triggers, in Language Acquisition, 2012.

7 comments:

  1. On the computational complexity front:
    Chihiro Shibata and Ryo Yoshinaka. PAC Learning of Some Subclasses of Context-Free Grammars with Basic Distributional Properties in ALT 2013 (so not published yet) show some computationally efficient and sample efficient (non-Bayesian) learning of CFGs. Not all CFGs of course, but classes that include all regular languages.

    And of course the Smith and Cohen paper we discussed last time establishes learnability of PCFGs from small samples if we neglect computational complexity. Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning, Shay B. Cohen and Noah A. Smith, Computational Linguistics (2012)

    And if you don't care about sample complexity or computational complexity then Angluin's technical report from 1988 "Identifying languages from stochastic examples" shows that a huge class can be learned.

    ReplyDelete
    Replies
    1. These are all worthy examples from the Formal Learning Theory tradition but...
      Bayes Daze was meant to focus on, well, Bayesian approaches to learning. We'll
      take up Formal Language Theory in later posts. But, as an appe-teaser, it may be of some interest to note that recent work cited in the comment or recent
      reviews of the same is seemingly unaware that in many cases it is rehearsing the findings of Formal Learning Theory from long ago. The various notions of 'prudence' (introduced by Scott Weinstein and Dan Osherson in 1982), exact learning, information-efficiency, probabilistic data, and computational complexity were all thoroughly examined in the Formal Learning Theory tradition, as described in the Osherson, Stob, & Weinstein book, "Systems That Learn" (1987) or its 2nd edition. One difference is that this older tradition used recursive function theory applied to learning, derived from Manuel Blum's (1975) seminal paper, rather than modern day complexity classes. But qualitatively it's the same. As Dan Osherson (p.c.) tells me, there don't seem to be any spanking-new, crisp results here. Now, if I were a betting man, when it comes to learning theory, I'd wager that Dan and Scott know what they are talking about. Interestingly enough, Scott and Dan have also found that there *is* a connection between Formal Learning Theory and statistical inference, though it may not be quite what you would have desired. So much for the teaser. And in truth though, if any of this really *was* a solid solution to the problem of induction, do you think we'd all be sitting around typing into funny blogger boxes? I'd be on Wall Street, cleaning up, or, better yet, I'd have already been on Wall Street and cleaned up, and now retired to a sunny island sipping those drinks….

      Delete
  2. Boy you guys are a tough crowd -- I give you a paper so new it hasn't been published yet, and you reject it without even reading it because a friend of yours says it was all done before in the 80s!

    I am a proud owner of the Systems that learn book, and a paper copy of Horning's thesis for that matter, and they are both good reads, but the STL book does have some gaps -- and probabilistic learning and complexity theory are the two main ones. So if you think that computational complexity is a big concern, and you think that probabilistic reasoning is important, then the STL book is *not* the right place to start -- IIRC they even removed all of the probabilistic stuff (all 3 pages of it in the first edition) from the second edition.

    (Lots of Machine learning people do leave academia and go to Wall street (e.g. all of the IBM speech recognition guys to Jim Simon's hedge fund .. ), and I guess by now they all have yachts and/or private islands but unfortunately CFGs and MCFGS aren't widely used in finance so my chances of a nice consultancy gig are low ....)

    ReplyDelete
    Replies
    1. Hey Alex, your parenthetical facts there aren't quite right. Two of the major people working on IBM's machine translation system (not speech recognition), Peter F. Brown and Bob Mercer, left IBM to join Renaissance. If anyone else from that (large) team now works in finance, I don't know it; the one speech recognition person involved was Fred Jelinek, who is (tellingly enough) not co-author on their 1993 paper (the one with details).

      Delete
  3. And, if Bayes is so old-hat and done business, who did people such as Crain and Thorton in 2002 (and maybe even more recently) complain about Wexler and Culicover's Uniqueness Principle without that it has a softer and more plausible statistical version in which you try to improve your grammar by increasing the extent to which it can predict what people will say under various circumstances, including their intended meaning, which can often be guessed from context and partial knowledge of the language?

    ReplyDelete
  4. without noting that <- without that

    ReplyDelete
  5. Like Avery I think the claim that probabilistic learning was widely accepted in models of language acquisition in generative grammar is a little questionable; Charles Yang was complaining in the last thread about getting stick for probabilistic learning (in 2002!)

    I'd be particularly interested to hear your take on the Subset Principle (Wexler, Manzini, Berwick...) because to me, it only gets its force from a non-probabilistic learning paradigm.

    (BTW, if you only want Bayesian models of learning -- then maybe Chater and Vitanyi's paper in Journal of Mathematical Psychology is essentially a very big extension of Horning's work. (but not computationally bounded).)

    ReplyDelete