The title of this post is doubly misleading. First, I have no problem with the Reverend. He’s never treated me badly, most likely because he’s been dead for some time. It’s the modern day application of his rule that disturbs me. Second, my “problem” is more a discomfiture than a full-blown ache. I don’t know enough (though I wish I did, really) to be distressed. However, uncomfortable I am (say this as Yoda would), and have decided to post about this unease in the hope that a kindly Bayesian will take pity on me and put my worries to rest. So, what follows tries to articulate the queasy feelings I have about the Bayesian attitude towards cognitive problems, especially in the domain of language, and I put them here, not because I am confident in the following remarks, but because I want to see if the way I am thinking about these matters makes any sense. So let’s start.
First, these divagations have been prompted by a nice comment by Mark Johnson and reply by Charles Yang and comment by Aaron White here. The comments lead me to think about my worry in the following way. Be warned, it takes a little time to get there.
As you know, I am obsessed with the contrast between rationalist (R) and empiricist (E) conceptions of mind. This contrast has often, IMO, been misunderstood as a contrast between nativist and non-nativist conceptions. However, this cannot be correct, for any theory of acquisition, even the most empiricist ones requires natively given structures to operate, i.e. every theory of “learning” needs a given (i.e. native) hypothesis space against which data provided by the environment is evaluated. So a better way of marking the R/E contrast is in how articulate the hypothesis space in any given domain is. Es take as their 0th assumption that the space is pretty wide and fairly flat and that learning, (i.e. the sophisticated use of environmental information) guides navigation through the space. Rs take as their 0th assumption that cognitive spaces are pretty narrow and highly structured and that environmental influences, though not unimportant, play a secondary role in explaining why/how acquirers get from where they start to where they end up. If there are lots of roads from A to B then finding the best one can be a very complicated task. If there is just one or two, choosing correctly is not nearly as complicated. This I take to be the main R/E difference: both concede the importance of native structure and both leave a role for environmental input. The difference lies in the relative weight each assigns to these different factors. Es bet that the action lies with good ways of evaluating the environmentally provided information. Rs lay their money on finding the narrow set of articulated options. With this as background, here’s my problem with Reverend Bayes' descendants.
I believe that Bayesian methods generally favor the first conception of the acquisition problem. I am pretty sure that this is not required, i.e. there is nothing in Bayesianisn per se that requires this. However, one reason to use fancy counting methods (and they can be fancy indeed (think Dirchlet)) is the belief that how one counts is largely causally responsible for where one ends up. In other words, Bayesian methods are interesting to the degree that the set of options is wide. If so the real trick is to figure out how to efficiently navigate this space in response to environmentally provided information. Thus, Bayesian affinities resonate harmoniously with E-like background assumptions. Consequently, if one believes like I do that a good deal (most?) of the interesting causal action lies with constrained and articulated shape of the hypothesis space then one will look on Bayesian predilections with some suspicion. Put more diplomatically, if there is a trade off between how tight and structured the space of options is and how complex and sophisticated the learning procedure is then Rs and Es will place their research bets in different places even if both features (i.e. the shape of the space and the nature of the learning theory) are agreed to be important.
If this is so, then the problem with Bayesiansism from my Rish point of view is that it presupposes an answer to (and hence begs) the fundamental question of interest: how structured is the mind?
You can get a good taste of this from Perfors et. al.’s paper (here). What it shows is that if one starts with three grammatical options, a linear grammar, a simple right branching grammar and a phrase structure grammar (PSG), then there is information in the linguistic input that would favor choosing PSGs and that an (ideal) Bayesian acquisition device could use this information to converge on this grammar. This conclusion gets used to argue that linguistic minds need not specify that the choice of hierarchical grammars by LADs (viz. PSGs) is “innate” for a Bayesian learning mechanism suffices to reach this same conclusion without ruling out the other options nativistically. Putting aside whether anyone argued for what Perfors et. al. argued against (see here for a very critical discussion), what is useful for present purposes is that Perfors et. al. illustrate how Bayesians trade Bayesian learning for restrictive hypothesis spaces. Indeed, in my experience (limited though it is) I’ve noticed that one thing Bayesians never mind doing is throwing in another option into the space of possibilities, secure in the knowledge that, in the limit, the Bayesian learner will get to the right one no matter how big the space is.
To repeat something I said earlier, it is possible that Bayesian methods of counting have advantages even in highly structured hypothesis spaces of the kind that syntacticians like me are predisposed to think exist in the domain of language. However, this is what needs showing, in my view. However, and this lies behind some of my worry, one can view the SYTG paper discussed in a prior post (here) as arguing that in such a context such methods are not at all helpful. Indeed, they point in the wrong direction. Berwick gave similar arguments at the last LSA for morphological acquisition. In both kinds of cases, it seems that in the more restricted hypothesis domains that they consider much simpler counting procedures do a whole lot better, and for principled reasons. So, even though there is nothing incompatible between the Reverend’s rule and structured minds, there is an affinity between Eism and Bayesianism that shows up both in theory (how the computational problem is posed) and practice (in concrete proposals for dealing with particular problems).
So, why do the Reverend’s acolytes bother me? Reason/feeling number 1: because their conception of acquisition is largely environmentally driven and my Rish sensibilities based as they are in what happens in the domain of language leads me to think that this is wrong. And not just a little wrong, but deeply wrong. Wrong in conception, not wrong in execution. Or, to put this in Marr’s terms, Bayesians in their E-nishness have misconstrued the computational problem to be solved. It’s not how do we use environmental input to navigate a big flat space but how do we use such data to make relatively simple structured choices.
The Marr segue above leads to a second (and largely secondary) source of my unease. Bayesians often describe their theories as Level 1 computational theories in Marr’s sense (see here in reply to here). Here’s Mark Johnson, for example, from the above linked to comment. I interpret “probabilistic” here as “Bayesian.”
A probabilistic model is what Marr called a "computational model"; it specifies the different kinds of information involved and how they interact.
There is a good reason for why Bayesian’s endorse this view of their proposals; interpreted algorithmically these theories often seem to be computational disasters (e.g. here and here). Suffice it to say that there seems to be general agreement that Bayesian analyses do not lend themselves to easy transparent algorithmitization. Oddly, as Stabler notes here, in discussing the efficient parsability of MG definable languages, this is not the case for standard minimalist grammars:
In CKY and Earley algorithms, the operations of the grammar (em and im) [internal and external merge, NH] are realized quite directly [my emphasis, NH] by adding, roughly, only bookkeeping operations to avoid unnecessary steps (p.8).
Indeed, in my limited experience most generative parsing models starting from the Marcus parser onward have had relatively transparent relations between grammars and parsers, and this was considered a virtue of both (see here). Indeed, since grammars to get used and seem to be used relatively effectively, it would be copacetic if there were a nice simple relation between competence grammars and parsing grammars (between grammars and algorithms that are used to parse incoming “language”). However, from what I can gather, this is something of a problem for Bayesians, as it seems that the move from a level 1 competence theory to a level 2 algorithmic accounts won’t be particularly simple or transparent as the simple transparent ones seem to be a computational mess.
My earlier post on word acquisition (here) touches on this, noting that the procedures that SYTG ran to test the Bayesian story was a pain to run and that this is a general feature of Bayesian accounts. However, I am sure that the issue is very complex and that I have probably misunderstood matters (hence my setting down my worries to act, I hope, as useful targets).
Let me end on one thing that I like about Bayesian approaches. It seems that they have a good way of dealing with what Mark calls chicken and egg problems. Bayesians have ways of combining two difficult problems to make the solving of each easier. This looks like it would very often be a useful thing to be able to do. And to the degree that Bayesians offer a compelling way of doing this, this is a good thing. A question: is this kind of solution to the chicken-egg problem limited to Bayesian kinds of analysis or is it a property that simpler counting systems could encode as well? If it is a distinctive property of Bayesianism, that would seem to be a very nice feature.
Let me abruptly end here and let the target practice (and enlightenment) begin.
 This is one of the nicest things about blogs. In contrast to articles where you need arguments that start from reasonable premises and go in a coherent direction, in a blog post it is possible to ruminate out load and hope that with a little help from your “friends” a little more clarity might be forthcoming.
 It is a curious consequence of the Bayesian position that, at first blush, they postulate a whole lot more native givens than Rs typically do. I suspect that this is related to what Gallistel and King call the “infinitude of the possible.” The Bayesian way is to load the hypothesis space with LOTS of options and winnow them down using environmental information. This puts a lot into the space. If the space is given (i.e. innate) then this approach has the curious property of loading the mind with a lot of stuff, much more than Rs typically consider there. So, in a curious sense, Es of this stripe are far more “nativist” than Rs are.