Thursday, August 15, 2013

Bayes Daze I.5

I had planned to post Bayes Daze II, but there were such interesting questions that it seemed best to do a version I.5 first. As with many of my blog posts, Bayes Daze I's primary aim was historical continuity: it begins by noting that the *consensus* about serious formal work into language learnability was from the very start probabilistic, and at least for my money, has remained so, from Horning and Wexler to the present day. So the take-away wasn’t that Bayes is old-hat and done business.  Rather, it was that the Bayesian framework of using posterior probability to search for grammars (also known as ‘model selection’ in the Bayesian biz) has not really changed all that much.  It also pointed to computational intractability as a pitfall noted from the very start, and that this problem hasn’t been resolved – it is now known to be provably intractable, even if one is  after merely approximate solutions. This problem rears its ugly head in all the *specific* Bayesian language models I have had the time to examine closely (verb alternations, 'one' anaphora, word-object linking, P&P style parameter setting, Johnson's admirable new minimalist-grammar inspired parameter-feature learning, and some others).  Here Bayes Daze I even offered a particular solution path that has been shown to work in some cases: parameterized complexity theory.

To be sure, there *have* been many important advances since Horning, some technical, and some linguistic. I mentioned one (handling all CFGs, not just unambiguous ones – and I’d still really like someone to point me to any citation in the literature that explicitly mentions this important limitation on Horning's original result). Part II was going to explore others. So a (very) short list of these here for now:

1. Better grammatical descriptions.  

(i) Use something other than vanilla PCFGs, such as head-based formalisms or dependency grammars. For example, to explain the poor performance of inference using PCFGs, at the 1995 ACL meeting at MIT, Carl De Marcken took a hard look at the ‘shape’ of the search space being explored.  He found that the ‘locality’ inherent in CFG rules also made them susceptible to getting very easily trapped in local maxima.  In fact, this is a kind of probabilistic analog of the deterministic ‘local maxima’ problem that Gibson and Wexler discovered in their ‘Triggers’ paper. As soon as one scales up to many rules or parameters, the search space can have all sorts of bumps and dimples, and convergence can be a very tricky matter.  The technology we have can’t solve this kind of nonlinear optimization problem directly, but must make use of techniques like clever sampling methods.  It’s still not guaranteed. (Cf. the Johnson learner below.)   In particular, Carl noted that in examples such as “walking on ice” vs. “walking in ice”, CFG rules actually block the ‘percolation’ of mutual information between a verb and a following prepositional phrase (or, similarly, the bigram mutual information between Subject nouns and verbs). We’d like to have the information about ‘on’ vs. ‘in’ to somehow be available to the rule that looks at just the nonterminal name PP. To try to solve this, he formalized the notion of ‘head’ for PCFGs, so that the information from, e.g., the head noun could be percolated up to the whole phrase. He examined the resulting improved learning, suggesting that a dependency grammar based entirely on head-modifier relations might work even better. You can read about it here.   The rest of the story – importing this idea into statistically-based parsing, is, as they say, history.

(ii) Parameterize using more linguistically savvy representations.  Example: Mark Johnson's recent model that he presented at the 19th Intl Congress of Linguists this July accounts for French/German/English differences in terms of Pollack's 'classical' analysis (1989), in which the German “verb-second” effect is broken down into 3 separate feature parameters: verb-to-tense movement; tense-to-C movement; and XP-to-SpecCP movement.  Instead of exploring the space of all CFGs, the program searches much more constrained territory. Nonetheless, this is still challenging: if you add just a few more parameters, such as null Subjects, then the nonlinear optimization can fail, and so far it doesn’t quite work at all with the 13 parameters that Fodor and Sakas have developed.  (If you're familiar with the history of transformational generative grammar you might realize that CFGs have had an extremely short shelf-life: they aren't in the original formulation (circa 1955), where generalized transformations do the work of context-free rules; they appear in 1965, Aspects of Theory of Syntax, but in a very Marx-like way, here they contain the seeds of their own destruction, being redundant with lexical entries.  By 1970's Remarks on nominalization, they're already dust, done in by X-bar theory.)


2. Go nonparametric. What's that?  Well, roughly, it means that instead of specifying the number and distribution of model parameters in advance, we let the data do the talking, and assume that the number of parameters can be variable and possibly infinite.  Now, this demands more heavy-duty statistical ammunition: it's why people invoke all those wonky things you read about, like infinite Dirichlet distributions, along with the ‘Chinese restaurant/Indian buffet/Your-own-ethnic/dinner processes’ and such stuff.  This resolves at least one worry with Bayesian (parametric) approaches that I did not raise (but which Andy Gelman and Cosma Shalizi do raise in this paper):  what happens to Bayesian inference if it turns out that your hypothesis space doesn’t include the ‘truth’?  The answer is that your Bayesian inference engine will still happily converge, and you’ll be none the wiser.  So, if you want to really hedge your bets, this may be a good way to go. From this angle, the one I embrace, Bayesian inference is simply one smoothing method in the toolkit, all to deal with the bias-variance dilemma, what’s called a ‘regularization device.’ (More on this in part II) Similarly, if you don’t know the shape of the prior distribution, you can try to figure it out it from the data itself via so-called ‘empirical Bayes’; and here also there have been terrific algorithmic advances to speed this up and make this more doable. Now, you might well ask how this shifts the balance between what is ‘given’ a priori, and what is learned from experience, and in my view that’s indeed the right question to ask.  Here’s where, at least for me, one has to be acquainted with the data: for instance, if ‘experience’ gives you no examples of sentences with missing Subjects (as is roughly true in English), and yet English children doggedly drop Subjects, then this is hard to explain via experience-driven behavior. 

As to the reason why people have not embraced probabilistic formulations whole-heartedly, it must surely be true that it’s partly the zeitgeist of the field, but there are other reasons as well – Norbert covered the waterfront pretty well back in his Nov. 5 post.  I think it all boils down to explanatory power.  If some method pushes the explanatory envelope, I am still optimistic enough to believe that people will buy it.  But as long as some approach does not explain why we see *this* particular set of human grammars/languages as opposed to some others, no deal. So for instance, if we drop the idealization of perfect language acquisition, one can begin to build better explanatory models of language change – inherently statistical ones at that.  And this has come to be generally accepted, as far as I can make out. So the challenge on the language acquisition/learning front is to point to a poster child case where we get better explanations by invoking probabilistic devices – not, as Norbert notes, for details regarding individual language behavior, or different developmental trajectories, but something deeper than that. I am sure that such cases exist, but it has been hard to find them, just like it’s been hard to reduce grammatical constraints to parsing considerations.  The US Food and Drug Administration still does not buy probabilistic distributions as a proper rationale for new drug efficacy.  Go figure.

4 comments:

  1. Great post Bob; you have put your finger on why I am not a Bayesian.
    Because it doesn't help -- if everything is learnable then you don't get any insight into the sorts of languages that are learnable, and as Partha N used to say, insight is what you want from a mathematical model.
    And they are computationally intractable as Iris van Rooij has been pointing out.

    But the intractability of Bayesian inference doesn't apply to non Bayesian probabilistic learning approaches which may be computationally efficient: and you seem to be flipping backwards and forwards about whether you are only interested in Bayesian things or all. E.g. de Marcken's paper is about non-Bayesian learning (EM if I recall). So it is important to remember that the flaws of Bayesian learning are not the flaws of probabilistic learning in general.

    ReplyDelete
  2. Alex, thanks for reminding me about this. You are spot-on: Carl de Marcken had used EM in his tests. The pitfall here weren't computational intractability, it was the difficulty of finding a global optimum in a high-dimensional, nonlinear space.

    ReplyDelete
  3. The view of Bayes might look different for people who spend a lot of time amongst functionalists, diversity linguists, etc, because it provides a clear reason for formalizing grammars, as well as an upgrade to the evaluation metric that fits in nicely with the concerns of variational sociolinguistics. I think the view of UG that goes naturally with Bayes is more acceptable to such people to.

    ReplyDelete
  4. And a question, what about the Kwiatkowski et. al CCG learner, which is labelled as Bayesian, and does appear to learn something.

    ReplyDelete