I had planned to post Bayes Daze II, but there were such interesting questions that it seemed best to do a version I.5 first. As with many of my blog posts, Bayes Daze I's primary aim was historical continuity: it begins by noting
that the *consensus* about serious formal work into language learnability was
from the very start probabilistic, and at least for my money, has remained so,
from Horning and Wexler to the present day. So the take-away wasn’t that Bayes
is old-hat and done business. Rather, it
was that the Bayesian framework of using posterior probability to search for
grammars (also known as ‘model selection’ in the Bayesian biz) has not really
changed all that much. It also pointed
to computational intractability as a pitfall noted from the very start, and
that this problem hasn’t been resolved – it is now known to be provably
intractable, even if one is after merely
approximate solutions. This problem rears its ugly head in all the *specific*
Bayesian language models I have had the time to examine closely (verb
alternations, 'one' anaphora, word-object linking, P&P style parameter
setting, Johnson's admirable new minimalist-grammar inspired parameter-feature
learning, and some others). Here Bayes
Daze I even offered a particular solution path that has been shown to work in
some cases: parameterized complexity theory.

To
be sure, there *have* been many important advances since Horning, some
technical, and some linguistic. I mentioned one (handling all CFGs, not just
unambiguous ones – and I’d still really like someone to point me to any
citation in the literature that explicitly mentions this important limitation
on Horning's original result). Part II was going to explore others. So a (very)
short list of these here for now:

1.
Better grammatical descriptions.

(i)
Use something other than vanilla PCFGs, such as head-based formalisms or
dependency grammars. For example, to explain the poor performance of inference
using PCFGs, at the 1995 ACL meeting at MIT, Carl De Marcken took a hard look
at the ‘shape’ of the search space being explored. He found that the ‘locality’ inherent in CFG
rules also made them susceptible to getting very easily trapped in local maxima. In fact, this is a kind of probabilistic analog
of the deterministic ‘local maxima’ problem that Gibson and Wexler discovered in
their ‘Triggers’ paper. As soon as one scales up to many rules or parameters,
the search space can have all sorts of bumps and dimples, and convergence can
be a very tricky matter. The technology
we have can’t solve this kind of nonlinear optimization problem directly, but
must make use of techniques like clever sampling methods. It’s still not guaranteed. (Cf. the Johnson
learner below.) In particular, Carl noted
that in examples such as “walking on ice” vs. “walking in ice”, CFG rules
actually block the ‘percolation’ of mutual information between a verb and a
following prepositional phrase (or, similarly, the bigram mutual information
between Subject nouns and verbs). We’d like to have the information about ‘on’
vs. ‘in’ to somehow be available to the rule that looks at just the nonterminal
name PP. To try to solve this, he formalized the notion of ‘head’ for PCFGs, so
that the information from, e.g., the head noun could be percolated up to the
whole phrase. He examined the resulting improved learning, suggesting that a
dependency grammar based entirely on head-modifier relations might work even
better. You can read about it here. The rest of the story – importing this idea
into statistically-based parsing, is, as they say, history.

(ii)
Parameterize using more linguistically savvy representations. Example: Mark Johnson's recent model that he
presented at the 19th Intl Congress of Linguists this July accounts for
French/German/English differences in terms of Pollack's 'classical' analysis
(1989), in which the German “verb-second” effect is broken down into 3 separate
feature parameters: verb-to-tense movement; tense-to-C movement; and
XP-to-SpecCP movement. Instead of
exploring the space of all CFGs, the program searches much more constrained
territory. Nonetheless, this is still challenging: if you add just a few more
parameters, such as null Subjects, then the nonlinear optimization can fail,
and so far it doesn’t quite work at all with the 13 parameters that Fodor and
Sakas have developed. (If you're
familiar with the history of transformational generative grammar you might
realize that CFGs have had an extremely short shelf-life: they aren't in the
original formulation (circa 1955), where generalized transformations do the
work of context-free rules; they appear in 1965, Aspects of Theory of Syntax,
but in a very Marx-like way, here they contain the seeds of their own
destruction, being redundant with lexical entries. By 1970's Remarks on nominalization, they're
already dust, done in by X-bar theory.)

2. Go nonparametric. What's that? Well, roughly, it means that instead of specifying
the number and distribution of model parameters in advance, we let the data do
the talking, and assume that the number of parameters can be variable and
possibly infinite. Now, this demands more
heavy-duty statistical ammunition: it's why people invoke all those wonky
things you read about, like infinite Dirichlet distributions, along with the ‘Chinese
restaurant/Indian buffet/Your-own-ethnic/dinner processes’ and such stuff. This resolves at least one worry with
Bayesian (parametric) approaches that I did not raise (but which Andy Gelman
and Cosma Shalizi do raise in this
paper): what happens to Bayesian
inference if it turns out that your hypothesis space doesn’t include the
‘truth’? The answer is that your
Bayesian inference engine will still happily converge, and you’ll be none the
wiser. So, if you want to really hedge
your bets, this may be a good way to go. From this angle, the one I embrace,
Bayesian inference is simply one smoothing method in the toolkit, all to deal
with the bias-variance dilemma, what’s called a ‘regularization device.’ (More
on this in part II) Similarly, if you don’t know the shape of the prior
distribution, you can try to figure it out it from the data itself via
so-called ‘empirical Bayes’; and here also there have been terrific algorithmic
advances to speed this up and make this more doable. Now, you might well ask
how this shifts the balance between what is ‘given’

*a priori*, and what is learned from experience, and in my view that’s indeed the right question to ask. Here’s where, at least for me, one has to be acquainted with the data: for instance, if ‘experience’ gives you no examples of sentences with missing Subjects (as is roughly true in English), and yet English children doggedly drop Subjects, then this is hard to explain via experience-driven behavior.
As to the reason why people
have not embraced probabilistic formulations whole-heartedly, it must surely be
true that it’s partly the

*zeitgeist*of the field, but there are other reasons as well – Norbert covered the waterfront pretty well back in his Nov. 5 post. I think it all boils down to explanatory power. If some method pushes the explanatory envelope, I am still optimistic enough to believe that people will buy it. But as long as some approach does not explain why we see *this* particular set of human grammars/languages as opposed to some others, no deal. So for instance, if we drop the idealization of perfect language acquisition, one can begin to build better explanatory models of language change – inherently statistical ones at that. And this has come to be generally accepted, as far as I can make out. So the challenge on the language acquisition/learning front is to point to a poster child case where we get better explanations by invoking probabilistic devices – not, as Norbert notes, for details regarding individual language behavior, or different developmental trajectories, but something deeper than that. I am sure that such cases exist, but it has been hard to find them, just like it’s been hard to reduce grammatical constraints to parsing considerations. The US Food and Drug Administration still does not buy probabilistic distributions as a proper rationale for new drug efficacy. Go figure.
Great post Bob; you have put your finger on why I am not a Bayesian.

ReplyDeleteBecause it doesn't help -- if everything is learnable then you don't get any insight into the sorts of languages that are learnable, and as Partha N used to say, insight is what you want from a mathematical model.

And they are computationally intractable as Iris van Rooij has been pointing out.

But the intractability of Bayesian inference doesn't apply to non Bayesian probabilistic learning approaches which may be computationally efficient: and you seem to be flipping backwards and forwards about whether you are only interested in Bayesian things or all. E.g. de Marcken's paper is about non-Bayesian learning (EM if I recall). So it is important to remember that the flaws of Bayesian learning are not the flaws of probabilistic learning in general.

Alex, thanks for reminding me about this. You are spot-on: Carl de Marcken had used EM in his tests. The pitfall here weren't computational intractability, it was the difficulty of finding a global optimum in a high-dimensional, nonlinear space.

ReplyDeleteThe view of Bayes might look different for people who spend a lot of time amongst functionalists, diversity linguists, etc, because it provides a clear reason for formalizing grammars, as well as an upgrade to the evaluation metric that fits in nicely with the concerns of variational sociolinguistics. I think the view of UG that goes naturally with Bayes is more acceptable to such people to.

ReplyDeleteAnd a question, what about the Kwiatkowski et. al CCG learner, which is labelled as Bayesian, and does appear to learn something.

ReplyDelete