Ah,
the dog daze of August. Time to bask in
the sun while sipping a margarita or two…but, alas, reality intrudes, in this
case the recent discussion about Bayesian inference. Well, as Jerry Fodor once wrote at the
beginning of a wonderful rejoinder, “Do you want to know how to tell when you
have gotten old? It’s when a cyclical theory of history starts to strike you as
plausible. It begins to seem that the same stuff keeps coming around again,
just like Hegel said. Except that it’s not ‘transcended and preserved’; it’s
just back.” So, I think I must have
gotten old. Loyal blog readers may remember that in my very first post, “Going
off the Gold standard” way back in November 2012, here I wrote, “from the very beginning of serious research into
language learnability for natural languages, the probabilistic view has stood front
and center.” People quickly realized
that Gold’s learnability conditions – exact
identification of languages on all possible
positive texts – seemed cognitively implausible and overly restrictive, thus ought
to be relaxed in favor of, e.g., “probabilistic example presentation” and
“probabilistic convergence” (Wexler, 1970). I emphasized that this was the consensus position, as of 43 years ago. Interestingly,
the locus classicus of this research,
then as well as arguably now, was at Stanford – not only was Ken Wexler working
on probabilistic grammar learning (under Pat Suppes), but also Suppes himself
was constructing probabilistic context-free grammars (PCFGs) to describe the
child language transcripts that would later become CHILDES – e.g., Adam’s
grammar for Noun Phrases, dipping his toe into Bayesian waters. Even more germane
to Bayes Daze though, there was Jay Horning’s 1969 Stanford thesis, which
established two main results, the first being perhaps the most cited from this
era – but perhaps also the most misunderstood.[1]
Further, it’s not clear that we’ve been able to improve on Horning’s results. Let me explain. Here is what Horning did:
1. PCFGs
and Bayesian inference. To
escape Gold’s demonstration that one could not learn from positive-only
examples, Horning used PCFGs along with Bayes’ rule as a way to search for
grammars. The ‘best’ grammar G is the
one that maximizes the probability of a grammar G given a sequence of examples D,
i.e., p(G|D), where D corresponds to a sequence of positive
example sentences drawn from the target grammar. Now as usual, via Bayes’ rule one can rewrite p(G|D) as the prior probability of G times the likelihood, divided by the
probability of D: p(G)×p(G|D)/p(D). Also as usual, since D is fixed across all grammars, we can
ignore it and just maximize p(G)×p(D|G) to find the best G. The first term p(G),
the prior probability of a grammar G,
is presumed to be inversely proportional to the size of the grammar G, so smaller grammars are more likely
and so ‘better’, while the second term, the likelihood, tells us how well that
particular grammar fits the data.[2]
In this setting, Horning proved that one can come up with a learner that
converges to the correct target grammar to be learned – in fact, he even wrote
a 550 line LISP program INFER to do this. (Amusingly, he called this program
“somewhat large,” p. 132.) As far as I can tell, nearly every later commentator
has simply taken it for granted that Horning showed “that PCFGs could be induced
without negative evidence” – a quote taken from the Juravsky and Martin
textbook, 2nd edition, p. 488. But this is not quite true…there are two caveats, which don’t ever seem to get the
same airtime. First, Horning deliberately restricted his analysis to the case
of unambiguous probabilistic
context-free grammars. (“In the sequel we assume – except where
specifically noted – that unambiguous
grammars are desired, and will reject grammars which make any sample string
ambiguous” pp. 32-33, my emphasis.) So,
unless you think that natural language is unambiguous, Horning’s result cannot
be drunk neat. Now, this isn’t such a
big deal. The reason Horning deliberately
restricted himself was technical. Back in 1969 he didn’t know how to slice up
probability mass properly when dealing with ambiguous
PCFGs and their derivations so that the total probability would still add up to
1, as is necessary for a proper probability distribution. In the interim, we
have learned how to do this; see, e.g., Chi, Z., 1999, “Statistical properties
of probabilistic context-free grammars,” JACL,
25:1, 131-160. So, it should be
straightforward to extend Horning’s proof to all PCFGs, and perhaps someone has
already done this. (If so, I’m sure some reader can alert us to this.) Nonetheless,
most authors (in fact all I know of with one notable exception, and I don’t
mean me) don’t even seem to be aware of what Horning actually proved, which I chalk up to yet another instance of the
Internet Veil Effect, i.e., perhaps people have never bothered to actually read Horning’s thesis, and all the
second- and third-hand reports don’t mention the bit about unambiguous
grammars. Second, Horning tweaked the definition of convergence, so that it is
weaker than Gold’s original ‘identification in the limit.’ Horning’s learner converges
when the probability of guessing the wrong grammar gets vanishingly small – the
chance that a non-target grammar is guessed decreases with the number of
examples (See p. 80 in his thesis for the definition.) So, Horning’s learner can still conjecture
non-target grammars, no matter how many examples it has seen. Recall that Gold
requires that after some finite point only
the target grammar (or PCFG equivalents) can be guessed. You might feel that Horning’s convergence test
is psychologically more cogent – related to Wexler’s notion of “measure-1”
learnability – and I think you’d be right.
So we’re probably on safe ground here also. Horning’s second bullet, however, isn’t so
easily dodged.
2. Computational
tractability. Horning
found that his inference method was computationally intractable, because the
space of PCFGs is simply too large. “Although the enumerative Bayesean [sic]
procedure presented in Chapter V and refined in later chapters is formally
optimal, its Achilles’ heel is efficiency….the enumerative problem is immense” p.151-152.
He notes, e.g., that just the number of possible grammars with N nonterminals grows as 2 to the power N3, p. 157. Here he could find no easy paths to salvation.
Now, what progress have we made since
1969 with regard to these two basic results?
1. PCFGs
and Bayesian inference. Well,
we’re still roughly in the same place. Aside from fixing the technical glitch with
ambiguous PCFGs, the same framework is generally assumed, without change. See, e.g., Perfors et al. (2011) for a representative example. The smaller the
grammar, the larger its prior probability, announced as a kind of ‘Occam’s
razor’ principle (simpler descriptions are better); the ‘fit’ of data to
grammar is given as p(D|G)
as before; and so forth. We’ll dive into this formulation more in the next post,
including its (perhaps surprising) historical antecedents. But it’s easy to see
that very little has really changed since Horning’s formulation. One new wrinkle has been the addition of hierarchical Bayesian models, but we
have to defer that discussion for now.
2. Computational
tractability. Same deal
here. Current work acknowledges that searching the entire space of PCFGs is
intractable, and in fact we now know much more. In general, Bayesian inference
is provably intractable (NP-hard);
Cooper, 1990; Kwisthout, 2009, 2011; Park & Darwiche, 2004; Shimony, 1994.
This means it would require an unrealistic amount of time for anything but
small hypothesis spaces. (If you read, e.g., the Perfors et al. paper you’ll see that they complain bitterly about this
computational problem all the time, even dropping lots of sentences from their test
corpus because it’s all just too complex.) Now, one response to this challenge
has been to suggest that the Bayesian computations don’t need to get exact or optimal results. Rather, they only need to approximate exact or optimal solutions, by, say, sampling, or by
using heuristics like a majority-vote-wins strategy (Chater et al. 2010;
Sanborn, 2010). At first blush, this all
seems like a winning way out of the computational dilemma, but in fact, it doesn’t help. Despite their appeal, “most–if not all–forms of approximating
Bayesian inference are for unconstrained input domains as computationally
intractable as computing Bayesian inference exactly” Kwisthout and van Rooj
(2013, 8, my emphasis). So, “the assumption of ‘approximation’ by itself is insufficient to explain how
Bayesian models scale to situations of real-world complexity” (Ibid., 3, my emphasis).
What
to do? Well, this is precisely the same difficulty
that was highlighted in the extensive computational complexity analysis of
linguistic theories that my students and I carried out in the early 1980s. The
take-home lesson from that work, as underscored in our book Computational Complexity and Natural
Language 1987, was not that
complexity theory should be used as some kind of steroidal spitting contest to
pick out the best linguistic theory, though the outside commentary sometimes reads
as if this were its real aim. If that’s
all that readers got out of the book, then we failed miserably. We ought to
refund you the dime in royalties we got for each copy. Rather, the goal was to
use computational complexity theory as a diagnostic
tool to pinpoint what part a theory was contributing to its intractability,
to be followed by positing constraints to see if one could do better. For instance, Eric Ristad found that
‘revised’ Generalized Phrase Structure Grammar (Gazdar, Pullum, Sag, Klein,
1985) was insanely complex, in the class of so-called exponential-polynomial
time problems – more than exponential time, and so far beyond the class of
NP-hard problems that it’s challenging to even conjure up a concrete example of
such a beast; see here. The source of the complexity? Largely the feature system, which permits all
kinds of crazy stuff: nonlocal agreement, unrestricted gaps, and so forth. So, Eric
offered up a replacement that drove the complexity down to the class NP. These
constraints were linguistically motivated: a strict version of X-bar theory;
recoverability of deletion; and constraints on extraction domains. You can consult Ristad (1989) linked above, or the 1987
book for details.
And
that’s the model we’re after here. In just the same way, Kwisthout and van Rooj
(2013) propose to apply techniques from the mathematical theory of parameterized complexity theory (Downey
& Fellows, 1999) to the problem of (approximate) Bayesian inference. What this comes down to is this: explicitly
include structural properties of the input – usually obvious dependencies such
as the number of parents for any given node, or the maximum path between nodes
– and see if one can figure out that it’s only these parts of the problem that
contribute to the complexity blow-up, and, further, that we can keep these
troublesome parameters within bounded ranges – perhaps binary branching, or, as
with recoverability of deletion, no nested empty nodes. Kwisthout & Van
Rooij, who just ran a tutorial on this very topic at this year’s Cognitive
Science Society meeting, also have a published paper showing precisely how this
can be done for certain Bayesian inference problems. The work is here is just
beginning. (See: “Bridging the gap between theory and practice of approximate
Bayesian inference,” 2013. Cognitive
Systems Research, 24, 2–8, or their CogSci tutorial slides here.)
Now, just in case you thought that this
Berwick guy has it all in for Bayesian inference, you should know that my years
slaving away in the biology electrophoretic gel mines have made me very
pragmatic, so I’ll grab any tool that works.
Norbert noted in his post that Bayesian approaches “seem to have a good way
of dealing with what Mark [Johnson] calls chicken and egg problems.” I agree.
In fact I’ve used it myself for just this sort of thing; e.g., in work
that Sourabh Niyogi (not Partha) and I did, reported in a 2002 Cognitive
Science Society paper, “Bayesian learning at the syntactic-semantic interface,”
here, we showed how easily Bayesian methods can deal with the
(to us pseudo) problem of ‘syntactic’ vs. ‘semantic’ bootstrapping by simply
dumping both sorts of evidence into the Bayesian hopper without any such labels
– because, after all, the kid just gets data, it doesn’t come pre-packaged one
way or the other.
That said, I should confess that I hold
the same view on Bayes as CMU statistician Cosma Shalizi: “there are people out
there who see Bayes’ Rule as the key to all methodologies, something essential
to rationality. Personally, I find this view thoroughly misguided and not even
a regulative ideal…” Now partly this is due to my intellectual heritage. I was
trained by Art Dempster and John Tukey, both catholics. I can even recall the
exact day in October 1973 when Art strolled into Statistics 126 and started
scribbling the first version of the EM method on the chalkboard, then finishing
it off with a bit on Bayesian hazards. But partly it’s also because I’ve always
found ‘universal solutions’ to be just that – usually not. As I’m sure my philosopher friends like
Norbert would agree, if there really were
a sure-fire solution to the problem of induction, by now everyone would
have heard about it, and such big news would already have made it to the
science pages of the NY Times or the Times’ The
Stone (well, at least to the New York Review of Books). My MIT colleague,
complexity theorist Scott Aaronson, also agrees. In his own blog ‘in response
to a question from the floor’ he observes that there are lots of other, non-Bayesian, ways to
tackle the learning problem, one being probably approximately correct (PAC-learning):
“When you know only one formalism to describe some phenomenon (in this case,
that of choosing hypotheses to fit data), it’s easy to talk yourself into
believing that formalism is the Truth: to paraphrase Caliph Omar, ‘if it
agrees with Bayesianism, it is superfluous; if it disagrees, it is
heresy.’ The antidote is to learn other formalisms.”
So
what about those other formalisms? Where does Bayesian inference fit in? That’s
forthcoming in Bayes Daze II. There, among other things, we will have you play
the Google Translation Game, and do a bit of reverse engineering to learn how
GT works. For warm up, you might trying running the following French sentence through
Google Translate to see what it spits out as English on the other end – and think
a bit about why it’s behaving this way, which will tell you a lot about Bayesian
inference: Le pomme mange le garçon (‘the
apple eats the boy’). Bayes Daze II will also tell you when the Bayesian
approach for choosing grammars was first proposed. This answer too may surprise
you. OK, back to those drinks…
[1] Jerry Feldman was also
working with Horning at Stanford, and his 1969/1970 Stanford AI lab report,
Stanford AIM-93, “Some decidability results on grammatical inference,” contains
another attempt to formulate a inference probabilistic procedure for
probabilistic grammars. I don’t discuss Feldman’s work here because a number of
years ago, Partha Niyogi and I went through Feldman’s proof and found that it
contained errors; we never took time to go back and see whether it could be
fixed up. See also: Feldman, Horning,
Gips, Reder, Grammatical complexity and
inference. Stanford AIM-89, 1969.
[2]More precisely, Horning
used a ‘meta-grammar’ – a grammar that generated the various PCFGs, roughly as
in Perfors, Tenenbaum & Regier, 2011 in Cognition
and along the lines of Fodor & Sakas’ work on triggers, in Language Acquisition, 2012.
On the computational complexity front:
ReplyDeleteChihiro Shibata and Ryo Yoshinaka. PAC Learning of Some Subclasses of Context-Free Grammars with Basic Distributional Properties in ALT 2013 (so not published yet) show some computationally efficient and sample efficient (non-Bayesian) learning of CFGs. Not all CFGs of course, but classes that include all regular languages.
And of course the Smith and Cohen paper we discussed last time establishes learnability of PCFGs from small samples if we neglect computational complexity. Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning, Shay B. Cohen and Noah A. Smith, Computational Linguistics (2012)
And if you don't care about sample complexity or computational complexity then Angluin's technical report from 1988 "Identifying languages from stochastic examples" shows that a huge class can be learned.
These are all worthy examples from the Formal Learning Theory tradition but...
DeleteBayes Daze was meant to focus on, well, Bayesian approaches to learning. We'll
take up Formal Language Theory in later posts. But, as an appe-teaser, it may be of some interest to note that recent work cited in the comment or recent
reviews of the same is seemingly unaware that in many cases it is rehearsing the findings of Formal Learning Theory from long ago. The various notions of 'prudence' (introduced by Scott Weinstein and Dan Osherson in 1982), exact learning, information-efficiency, probabilistic data, and computational complexity were all thoroughly examined in the Formal Learning Theory tradition, as described in the Osherson, Stob, & Weinstein book, "Systems That Learn" (1987) or its 2nd edition. One difference is that this older tradition used recursive function theory applied to learning, derived from Manuel Blum's (1975) seminal paper, rather than modern day complexity classes. But qualitatively it's the same. As Dan Osherson (p.c.) tells me, there don't seem to be any spanking-new, crisp results here. Now, if I were a betting man, when it comes to learning theory, I'd wager that Dan and Scott know what they are talking about. Interestingly enough, Scott and Dan have also found that there *is* a connection between Formal Learning Theory and statistical inference, though it may not be quite what you would have desired. So much for the teaser. And in truth though, if any of this really *was* a solid solution to the problem of induction, do you think we'd all be sitting around typing into funny blogger boxes? I'd be on Wall Street, cleaning up, or, better yet, I'd have already been on Wall Street and cleaned up, and now retired to a sunny island sipping those drinks….
Boy you guys are a tough crowd -- I give you a paper so new it hasn't been published yet, and you reject it without even reading it because a friend of yours says it was all done before in the 80s!
ReplyDeleteI am a proud owner of the Systems that learn book, and a paper copy of Horning's thesis for that matter, and they are both good reads, but the STL book does have some gaps -- and probabilistic learning and complexity theory are the two main ones. So if you think that computational complexity is a big concern, and you think that probabilistic reasoning is important, then the STL book is *not* the right place to start -- IIRC they even removed all of the probabilistic stuff (all 3 pages of it in the first edition) from the second edition.
(Lots of Machine learning people do leave academia and go to Wall street (e.g. all of the IBM speech recognition guys to Jim Simon's hedge fund .. ), and I guess by now they all have yachts and/or private islands but unfortunately CFGs and MCFGS aren't widely used in finance so my chances of a nice consultancy gig are low ....)
Hey Alex, your parenthetical facts there aren't quite right. Two of the major people working on IBM's machine translation system (not speech recognition), Peter F. Brown and Bob Mercer, left IBM to join Renaissance. If anyone else from that (large) team now works in finance, I don't know it; the one speech recognition person involved was Fred Jelinek, who is (tellingly enough) not co-author on their 1993 paper (the one with details).
DeleteAnd, if Bayes is so old-hat and done business, who did people such as Crain and Thorton in 2002 (and maybe even more recently) complain about Wexler and Culicover's Uniqueness Principle without that it has a softer and more plausible statistical version in which you try to improve your grammar by increasing the extent to which it can predict what people will say under various circumstances, including their intended meaning, which can often be guessed from context and partial knowledge of the language?
ReplyDeletewithout noting that <- without that
ReplyDeleteLike Avery I think the claim that probabilistic learning was widely accepted in models of language acquisition in generative grammar is a little questionable; Charles Yang was complaining in the last thread about getting stick for probabilistic learning (in 2002!)
ReplyDeleteI'd be particularly interested to hear your take on the Subset Principle (Wexler, Manzini, Berwick...) because to me, it only gets its force from a non-probabilistic learning paradigm.
(BTW, if you only want Bayesian models of learning -- then maybe Chater and Vitanyi's paper in Journal of Mathematical Psychology is essentially a very big extension of Horning's work. (but not computationally bounded).)