Wednesday, April 13, 2016

Yang (himself) on Bayes

Norbert has brought out the main themes of my paper much more clearly than I could have (many thanks for that). This entry is something of a postscript triggered by the comments over the past few days.

The comments remind me of the early days in the Past Tense debate. What does it mean to be a connectionist model? Can't it pass the Wug test if we just get rid of those awful Wickelfeatures? If not backprop, maybe a recurrent net? Most commentators tread a similar terrain:  What’s the distinction between a normative Bayesian model and a cognitive one? How essential is the claim of optimality? Is a model that uses non-Bayesian approximations Bayesian in name only? If not MAP, then how about a full posterior interpretation … [1]

These questions can never be fully resolved because they are questions about frameworks. As Norbert notes, frameworks can only be evaluated by the questions they raise and the answers they provide, not by whether it can or cannot do X because one can always patch things up. (Of course this holds for the Minimalist Framework as well.) A virtue of the Past Tense debate was that it grounded a largely conceptual/philosophical discussion in a well-defined empirical domain, and we have it to thank for a refined understanding of morphology and language acquisition. That represents progress, even if no minds were changed. So let’s focus on some concrete empirical cases, be it probability matching by rodents or Polish genitives by kids. Framework-level questions go nowhere, especially when the highest priests of Bayesianism disagree. 

As I said in the paper, none of my criticisms is necessarily decisive but taken together, I hope they make it worthwhile to pursue alternatives [2]: alternatives that linguists have always been good at (e.g., restricting hypothesis space), alternatives that take the psychological findings of language acquisition seriously, and alternatives that do not take forever to run. It’s disappointing to see all the hard lessons are forgotten. For instance, indirect negative evidence, which was always viewed with suspicion, is now freely invoked without actually working through its complications. The problem doesn't go away when the modeler peeks at the target grammar and rigs the machinery accordingly, even though the modeler is some kind of idealized observer.

Somewhere during the Second Act of the Past Tense debate, connectionist models that implicitly implemented the regular/irregular distinction started to appear. I remember it annoyed the heck out of a young Gary Marcus, but I suspect that an older and wiser Gary would take that as a compliment. 

[1]  A "true" Bayesian model does not necessarily do better. As I noted in the paper, one such model for morphological learning took a week to train on supervised data but only offers very marginal improvement over an online incremental and psychologically motivated unsupervised model, which processed almost a million words in under half an hour.

[2] The paper does offer an alternative, one embedded in a framework that insists on a transparent mapping between the Marrian levels. Like in the Past Tense debate, a critique is never enough, and one needs a positive counterproposal. So let's hear some counter-counter-proposals. 


  1. I agree that discussing particular proposals and natural language phenomena would more productive than the debate we've been having over the last few posts. The corollary is that we should make sure that we're not dismissing whole frameworks because of specific modeling choices that are not essential to those frameworks (and, in the case of MAP, are even at odds with the spirit of the framework). It wouldn't be productive if linguists stopped listening to you the moment you said "Bayesian model" because of misconceptions about the ideological commitments of the word "Bayesian" (ditto for "neural network").

    1. I hate to disrupt this kumbaya moment, but I disagree. The whole point of building SPECIFIC models is to use them to explore the general principles that motivate them. Now, it is possible that there are NO general principles (that's the way it looks to me wrt Bayes), in which case the only thing to look at are the specific proposals and there is nothing GENERAL to be learned. And that's too bad. That means that there is no theory guiding the modeling. This does not mean that the specific proposal are not interesting. No doubt they are. And, of course, the details matter. But, the take away message for me is that "Bayes" means nothing, so don't take it seriously. Good to know. Wish you had told us earlier.

  2. Just discovered this series of posts, which have been very interesting. Charles, your paper reminded me of some ideas that Scott Aaronson has been pushing regarding the perils of ignoring complexity when thinking about computation (of particular relevance is the section on induction). Although Scott's primary tool is asymptotic complexity analysis (which can easily be criticized on the timescales learning occurs on), he makes a strong case that even for normative models of the real-world phenomena, complexity matters.

    1. Thanks Chris. Highly relevant indeed. My friends in the quantum computing business tell me that there are intractable problems even on a quantum computer--and that's about as real as it'll get!

    2. I really like Iris van Rooij's paper "The Tractable Cognition thesis" which makes the arguments very well.