Faculty of Language: My Problems with Reverend Bayes

Monday, August 5, 2013

My Problems with Reverend Bayes

The title of this post is doubly misleading. First, I have no problem with the Reverend. He’s never treated me badly, most likely because he’s been dead for some time. It’s the modern day application of his rule that disturbs me. Second, my “problem” is more a discomfiture than a full-blown ache. I don’t know enough (though I wish I did, really) to be distressed. However, uncomfortable I am (say this as Yoda would), and have decided to post about this unease in the hope that a kindly Bayesian will take pity on me and put my worries to rest. So, what follows tries to articulate the queasy feelings I have about the Bayesian attitude towards cognitive problems, especially in the domain of language, and I put them here, not because I am confident in the following remarks, but because I want to see if the way I am thinking about these matters makes any sense.[1] So let’s start.

First, these divagations have been prompted by a nice comment by Mark Johnson and reply by Charles Yang and comment by Aaron White here. The comments lead me to think about my worry in the following way. Be warned, it takes a little time to get there.

As you know, I am obsessed with the contrast between rationalist (R) and empiricist (E) conceptions of mind. This contrast has often, IMO, been misunderstood as a contrast between nativist and non-nativist conceptions. However, this cannot be correct, for any theory of acquisition, even the most empiricist ones requires natively given structures to operate, i.e. every theory of “learning” needs a given (i.e. native) hypothesis space against which data provided by the environment is evaluated. So a better way of marking the R/E contrast is in how articulate the hypothesis space in any given domain is. Es take as their 0^th assumption that the space is pretty wide and fairly flat and that learning, (i.e. the sophisticated use of environmental information) guides navigation through the space. Rs take as their 0^th assumption that cognitive spaces are pretty narrow and highly structured and that environmental influences, though not unimportant, play a secondary role in explaining why/how acquirers get from where they start to where they end up. If there are lots of roads from A to B then finding the best one can be a very complicated task. If there is just one or two, choosing correctly is not nearly as complicated. This I take to be the main R/E difference: both concede the importance of native structure and both leave a role for environmental input. The difference lies in the relative weight each assigns to these different factors. Es bet that the action lies with good ways of evaluating the environmentally provided information. Rs lay their money on finding the narrow set of articulated options. With this as background, here’s my problem with Reverend Bayes' descendants.[2]

I believe that Bayesian methods generally favor the first conception of the acquisition problem. I am pretty sure that this is not required, i.e. there is nothing in Bayesianisn per se that requires this. However, one reason to use fancy counting methods (and they can be fancy indeed (think Dirchlet)) is the belief that how one counts is largely causally responsible for where one ends up. In other words, Bayesian methods are interesting to the degree that the set of options is wide. If so the real trick is to figure out how to efficiently navigate this space in response to environmentally provided information. Thus, Bayesian affinities resonate harmoniously with E-like background assumptions. Consequently, if one believes like I do that a good deal (most?) of the interesting causal action lies with constrained and articulated shape of the hypothesis space then one will look on Bayesian predilections with some suspicion. Put more diplomatically, if there is a trade off between how tight and structured the space of options is and how complex and sophisticated the learning procedure is then Rs and Es will place their research bets in different places even if both features (i.e. the shape of the space and the nature of the learning theory) are agreed to be important.

If this is so, then the problem with Bayesiansism from my Rish point of view is that it presupposes an answer to (and hence begs) the fundamental question of interest: how structured is the mind?

You can get a good taste of this from Perfors et. al.’s paper (here). What it shows is that if one starts with three grammatical options, a linear grammar, a simple right branching grammar and a phrase structure grammar (PSG), then there is information in the linguistic input that would favor choosing PSGs and that an (ideal) Bayesian acquisition device could use this information to converge on this grammar. This conclusion gets used to argue that linguistic minds need not specify that the choice of hierarchical grammars by LADs (viz. PSGs) is “innate” for a Bayesian learning mechanism suffices to reach this same conclusion without ruling out the other options nativistically. Putting aside whether anyone argued for what Perfors et. al. argued against (see here for a very critical discussion), what is useful for present purposes is that Perfors et. al. illustrate how Bayesians trade Bayesian learning for restrictive hypothesis spaces. Indeed, in my experience (limited though it is) I’ve noticed that one thing Bayesians never mind doing is throwing in another option into the space of possibilities, secure in the knowledge that, in the limit, the Bayesian learner will get to the right one no matter how big the space is.

To repeat something I said earlier, it is possible that Bayesian methods of counting have advantages even in highly structured hypothesis spaces of the kind that syntacticians like me are predisposed to think exist in the domain of language. However, this is what needs showing, in my view. However, and this lies behind some of my worry, one can view the SYTG paper discussed in a prior post (here) as arguing that in such a context such methods are not at all helpful. Indeed, they point in the wrong direction. Berwick gave similar arguments at the last LSA for morphological acquisition. In both kinds of cases, it seems that in the more restricted hypothesis domains that they consider much simpler counting procedures do a whole lot better, and for principled reasons. So, even though there is nothing incompatible between the Reverend’s rule and structured minds, there is an affinity between Eism and Bayesianism that shows up both in theory (how the computational problem is posed) and practice (in concrete proposals for dealing with particular problems).

So, why do the Reverend’s acolytes bother me? Reason/feeling number 1: because their conception of acquisition is largely environmentally driven and my Rish sensibilities based as they are in what happens in the domain of language leads me to think that this is wrong. And not just a little wrong, but deeply wrong. Wrong in conception, not wrong in execution. Or, to put this in Marr’s terms, Bayesians in their E-nishness have misconstrued the computational problem to be solved. It’s not how do we use environmental input to navigate a big flat space but how do we use such data to make relatively simple structured choices.

The Marr segue above leads to a second (and largely secondary) source of my unease. Bayesians often describe their theories as Level 1 computational theories in Marr’s sense (see here in reply to here). Here’s Mark Johnson, for example, from the above linked to comment. I interpret “probabilistic” here as “Bayesian.”

A probabilistic model is what Marr called a "computational model"; it specifies the different kinds of information involved and how they interact.

There is a good reason for why Bayesian’s endorse this view of their proposals; interpreted algorithmically these theories often seem to be computational disasters (e.g. here and here). Suffice it to say that there seems to be general agreement that Bayesian analyses do not lend themselves to easy transparent algorithmitization. Oddly, as Stabler notes here, in discussing the efficient parsability of MG definable languages, this is not the case for standard minimalist grammars:

In CKY and Earley algorithms, the operations of the grammar (em and im) [internal and external merge, NH] are realized quite directly [my emphasis, NH] by adding, roughly, only bookkeeping operations to avoid unnecessary steps (p.8).

Indeed, in my limited experience most generative parsing models starting from the Marcus parser onward have had relatively transparent relations between grammars and parsers, and this was considered a virtue of both (see here). Indeed, since grammars to get used and seem to be used relatively effectively, it would be copacetic if there were a nice simple relation between competence grammars and parsing grammars (between grammars and algorithms that are used to parse incoming “language”). However, from what I can gather, this is something of a problem for Bayesians, as it seems that the move from a level 1 competence theory to a level 2 algorithmic accounts won’t be particularly simple or transparent as the simple transparent ones seem to be a computational mess.

My earlier post on word acquisition (here) touches on this, noting that the procedures that SYTG ran to test the Bayesian story was a pain to run and that this is a general feature of Bayesian accounts. However, I am sure that the issue is very complex and that I have probably misunderstood matters (hence my setting down my worries to act, I hope, as useful targets).

Let me end on one thing that I like about Bayesian approaches. It seems that they have a good way of dealing with what Mark calls chicken and egg problems. Bayesians have ways of combining two difficult problems to make the solving of each easier. This looks like it would very often be a useful thing to be able to do. And to the degree that Bayesians offer a compelling way of doing this, this is a good thing. A question: is this kind of solution to the chicken-egg problem limited to Bayesian kinds of analysis or is it a property that simpler counting systems could encode as well? If it is a distinctive property of Bayesianism, that would seem to be a very nice feature.

Let me abruptly end here and let the target practice (and enlightenment) begin.

[1] This is one of the nicest things about blogs. In contrast to articles where you need arguments that start from reasonable premises and go in a coherent direction, in a blog post it is possible to ruminate out load and hope that with a little help from your “friends” a little more clarity might be forthcoming.

[2] It is a curious consequence of the Bayesian position that, at first blush, they postulate a whole lot more native givens than Rs typically do. I suspect that this is related to what Gallistel and King call the “infinitude of the possible.” The Bayesian way is to load the hypothesis space with LOTS of options and winnow them down using environmental information. This puts a lot into the space. If the space is given (i.e. innate) then this approach has the curious property of loading the mind with a lot of stuff, much more than Rs typically consider there. So, in a curious sense, Es of this stripe are far more “nativist” than Rs are.

21 comments:

ewanAugust 5, 2013 at 3:26 PM
Target practice...

(1) "I believe that Bayesian methods generally favor the [empiricist] conception of the acquisition problem. I am pretty sure that this is not required . . . If this is so, then the problem with Bayesiansism from my Rish point of view is that it presupposes an answer to (and hence begs) the fundamental question of interest: how structured is the mind?"

The problem is that you presuppose an answer to the fundamental question of interest: how empiricist is the Bayesian? This is a strawman - a common theme in replies to Bayesians - and is, moreover, false. As such, there isn't really much more to be said. See the replies to Jones and Love's BBS article (same strawman), which mainly said what I'm saying here, much more articulately, and many, many times.

(2) "However, one reason to use fancy counting methods . . ."

Probabilistic inference is not fancy counting. Probabilistic inference is not fancy counting. Probabilistic inference is not fancy counting! Probabilistic inference is (one form of) generalized deduction, and a consequence of using it is that it so happens that more evidence is better evidence in most circumstances. As a result, the math will shake out to be sensitive to how much data you have, which I guess is vaguely isomorphic to counting. But this is way too generous; I'm not sure how estimating the center of a cluster of data points in space by finding the mean of the observed data is really "counting" in any reasonable sense, even if it is sensitive to the number of data points.
ReplyDelete
Replies
NorbertAugust 5, 2013 at 3:41 PM
"Probabilistic inference is (one form of) generalized deduction, and a consequence of using it is that it so happens that more evidence is better evidence in most circumstance."

I take it that this is precisely what's at issue, no? It seems that in many circumstances, e.g. SYTG's, this is false. So isn't the assumption that it is generally true begging a question? And if this is your view, won't the main problem be getting lots of evidence and figuring out how to weight it? Again, there is no reason to think that Bayesians need have weak and unstructured hypothesis spaces, but many do and concentrate on figuring out how to massage the data to get where you want to go, as e.g. Perfors et.al. do. I will however, take a look at the BBS piece. Thx.
ReplyDelete
Replies
Alex ClarkAugust 6, 2013 at 7:09 AM
This comment has been removed by the author.
ReplyDelete
Replies
Alex ClarkAugust 6, 2013 at 7:26 AM
"It is a curious consequence of the Bayesian position that, at first blush, they postulate a whole lot more native givens than Rs typically do. I suspect that this is related to what Gallistel and King call the “infinitude of the possible.” The Bayesian way is to load the hypothesis space with LOTS of options and winnow them down using environmental information. This puts a lot into the space. If the space is given (i.e. innate) then this approach has the curious property of loading the mind with a lot of stuff, much more than Rs typically consider there. So, in a curious sense, Es of this stripe are far more “nativist” than Rs are."

This is an interesting point; suppose you have a prior P2 which only has two hypotheses H1, H2,
and a prior P4 which has four, H1,H2,H3,H4.
So from one point of view P2 has a "smaller" nativist component,
since it has less grammars, but from another P4 is weaker because it is lower entropy and less informative. Both of these intuitions are in my view valid, though the latter is more standard in Bayesian terms.

It depends on whether you have an information theoretic view of complexity (Shannon) or an algorithmic view (Kolmogorov). Information theoretically we say that P4 is less informative than P2 but P4 will normally be algorithmically more complex than P2.

Of course if the class of hypotheses is something like all MGs or MCFGs, then this is very low on both scales.

(previous comment had a typo that meant it didn't make sense).
ReplyDelete
Replies
Avery AndrewsAugust 6, 2013 at 7:11 PM
It makes perfectly good sense, it's just kind of vague and speculative. The E position is also rather speculative in the absence of E-learners or (weaker) nonconstructive specifications of descriptively adequate grammars from realistic types and amounts of days.

The Kwiatkowski et al CCG learner is an interesting case, it is somewhat R-ish in that CCG imposes some restrictions on the nature of the form-meaning relspationship, and seems to learn an impressive amount of stuff (but how would it fare on Icelandic, Kayardild or Dinka)?
ReplyDelete
Replies
ChrisAugust 7, 2013 at 3:15 AM
If your concern is that Bayesians might be "too E" and this troubles you, I think you should rest assured that they can be as "R" as you like (although in practice they may not be). Let me try to give a summary of my "big tent" Bayesian view. Bayesian statistics is more than happy to have richly structured hypotheses. As Charles Yang points out in the post that prompted this, his probabilistic parameter setting work has a Bayesian interpretation, and it certainly adheres strongly to traditional notions of syntax. Bayesian reasoning says just two things: that beliefs and data are treated as the some "kind" of things (statistical objects characterized by distributions); and that beliefs are updated in a particular way given evidence: in particular, such that the probability of a particular "posterior" belief is proportional to your prior belief in its correctness times the likelihood of the evidence assuming the particular belief. To the extent that you would want to claim that language learning (or any other kind of learning) is not "Bayesian", it would necessarily have to be a claim about one of these things being false. And, to the extent that observed behavior deviates from these ideals, the standard arguments with respect to "ideal learners" vs. "actual learners" are still worth keeping in mind.

Finally, one might also like Bayesian reasoning as a strategy that has a certainly friendliness to Darwin's problem. In decision theory, the Bayesian decision rule (making the minimum expected error decision) is an optimal strategy. To the extent that some other decision rule (or non-Bayesian inference procedure) would come to exist, it would be out-competed by Bayesians, all other things being equal.
ReplyDelete
Replies
Alex ClarkAugust 7, 2013 at 5:08 AM
I think there is a more general objection which is that these models are "learning" models rather than "acquisition" models. That is to say, Bayesian models use information rationally to select the appropriate grammar, or update the posterior probability, rather than having the grammar be triggered in some sub-rational brute causal way. Some people (Chomsky? Norbert? Paul Pietroski?... I don't know ) have a very strong conviction that language is not "learned" in this sense, but rather "grows" in some other way,
in a process analogous to the biological development of an organ etc etc. I guess you know the rhetoric.

ReplyDelete
Replies
Avery AndrewsAugust 11, 2013 at 10:14 AM
In spite of Norbert's qualms, I think generativists ought to be enthusiastic about Bayes, because it means that there has to be a prior, which on current knowledge will either have to take the form of a finite collection of parameters or some kind of rule/constraint notation with something like an evaluation metric defined over it, answering the perennial question of typological-functionalists and indolent undergraduates: what's the point of putting the grammars into a definite formal framework?

The fact that some Bayesians are pursuing very lean hypotheses about the content of UG should not be a worry, because if they're inadequate, this will become demonstrable in due course (I don't think it is yet).
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Monday, August 5, 2013

My Problems with Reverend Bayes

21 comments:

Contributors