Faculty of Language: Pouring gasoline on the flames: Yang on Bayes 3

Monday, April 11, 2016

Pouring gasoline on the flames: Yang on Bayes 3

I want to pour some oil on the flames. Which flames? The ones that I had hoped that my two recent posts on Yang’s critique of Bayes (here and here) would engender. There has been some mild pushback (from Ewan, Tal, Alex and Avery). But the comments section has been pretty quiet. I want to restate what I take to be the heart of the critique because, if correct, it is very important. If correct, it suggests that there is nothing worth salvaging from the Bayes “revolution” for there is no there there. Let me repeat this. If Yang is right, then Bayes is a dead end with no redeeming scientific (as opposed to orthographic) value. This does not mean that specific Bayes proposals are worthless. They may not be. What it means is that Bayes per se not only adds nothing to the discussion, but that taking its tenets to heart will mislead inquiry. How so? It endorses the wrong idealization of how stats are relevant to cognition. And misidealizations are as big a mistake as one can make, scientifically speaking. Here’s the bare bones of the argument.

1. Everyone agrees that data matters for hypothesis choice

2. Everyone agrees that stats matter in making this choice

3. Bayes makes 4 specific claims about how stats matter for hypothesis choice:

a. The hypothesis space is cast very wide. In the limit all possible hypothesis are in the space of options

b. All potentially relevant data is considered, i.e. any data that could decide between competing hypotheses is used to adjudicate among the hypotheses in the space

c. All hypotheses are evaluated wrt to all of the data. So, as data is considered every hypothesis’ chance of being true is evaluated wrt to every data point considered

d. When all data has been considered the rule is to choose that hypothesis in the space with the highest score

Two things are worth noting about the above.

First, that (3) provides serious content to a Bayesian theory, unlike (1) and (2). The latter are trivial in that nobody has ever thought otherwise. Nobody. Ever. So if this is the point of Bayes, then this ain’t no revolution!

Second, (3) has serious normative motivation. It is a good analysis of what kind of inference an inference to the best explanation might be. Normatively, an explanation is best if it is better than all other possible explanations and accounts for all of the possibly relevant data. Ideally, this implies evaluating all alternatives wrt to all the data and choosing the best. This gives us (3a-d). Cognitive Bayes (CB) is the hypothesis that normative Bayes (NB) is a reasonable idealization for people actually do when the learn/acquire something. And we should appreciate that this could be the case. Let’s consider how for a moment.

The idealization would make sense for the following kind of case (let’s restrict ourselves to language). Say that the hypothesis space of a potential Gs was quite big. For concreteness, say that we were always considering about 50 different candidate Gs. This is not all possible Gs, but 50 is a pretty big number computationally speaking. So say 50 or more alternatives is the norm. Then Bayes (3a) would function a lot like the standard linguistic assumption that the set of well-formed syntactic objects in a given language is effectively infinite. Let me unpack the analogy.

This infinity assumption need not be accurate to be a good idealization. Say it turns out that the number of well-formed sentences a native speaker of English is competent wrt is “only” 10¹⁰⁰⁰. Wouldn’t this invalidate the infinity assumption? No, it would show that it is false, but not that it is a bad idealization. Why? Because the idealization is a good one because it focuses attention onto the right problem. Which one? The Projection Problem: how do native speakers go from a part of the language all of it? How given exposure to only a subset of the language does a LAD get mastery over a whole language? The answer: you acquire recursive rules, a G, that’s how. And this is true whether or not the “language” is infinite or just very big. The problem, going from a subset to its containing superset, will transit via a specification of rules whether or not the set is actually infinite. All the infinite idealization does is concentrate the mind on the projection problem by making the alternative tempting idea (learning by listing) silly. This is what Chomsky means when he says in Current Issues” “once we have mastered a language, the class of sentences with which we can operate fluently or hesitation is so vast that for all practical purposes (and, obviously, for all theoretical purposes), we may regard it as infinite” (7, my emphasis NH). See: the idealization is reasonable because it does not materially change the problem to be solved (i.e. how to go from part of the language you are exposed to, to the whole language that you have mastery over).

A similar claim could be true of Bayes. Yes, the domain of Gs a LAD considers is in fact big. Maybe not thousands or millions of alternatives, but big enough to be worth idealizing to a big hypothesis space in the same way that it is worth assuming that the class sentences a native speaker is competent wrt is infinite. Is this so? Probably not. Why not? Because even moderately large hypothesis spaces (say with over 5 competing alternatives) turns out to be very hard to manage. So the standard practice is to use really truncated spaces, really small SWSs. But when you so radically truncate the space, there is no reason to think that the inductive problem remains the same. Just think if the number of sentences we actually knew was about 5 (roughly what happens in animal communication systems). Would the necessity of rules really be obvious? Might we not reject the idealization Chomky argues for (and note that I emphasize ‘argue’)? So, rejecting (3a) means rejecting part of the Bayes idealization.

What of the other parts, (3b-d)? Well, as I noted in my posts, Charles argues that each and every one is wrong in such a way as to be not worth making. It gets the shape of the problem wrong. He may be right. He may be wrong (not really, IMO), but he makes an argument. And if he is right, then what’s at stake is the utility of RB as a useful idealization for cognitive purposes. And, if you accept this, we are left with (1-2), which is methodological pablum.

I noted one other thing the normative idealization above was once considered as a cognitive option within linguistics. It was knows as the child-as-little-linguist theory. And it had exactly the same problems that Bayes has. It suggests that what kids do is what linguists do. But it is not the same thing at all. And realizing this helped focus on what the problem the LAD faces is. Bayes is not unique in misidealizing a problem.

Three more points and I end today’s diatribe.

First, one can pick and choose among the four features above. In other words, there is no law saying that one must choose the various assumptions as a package. One can adopt a SWS assumption (rejecting 3a) while adopting a panoramic view of the updating function (assuming that every hypothesis in the space is updated wrt every new data point) and rejecting choice optimization (3d). In other words, mixing and matching is fine and worth exploring. But what gives Bayes content, and makes it more than one of many bookkeeping notations, is the idealization implicit in CB as NB.

Second, what makes Bayes scientifically interesting is the idealization implicit in it. I mention this because as Tal notes in a comment (here), it seems that current Bayesians are promoting their views as just “set of modeling practices.” The ‘just’ is mine, but this seems to me what Tal is indicating about the paper he links to. But the “just” matters. Modeling practices are scientifically interesting to the degree that they embody ideas about the problem being modeled. The good ones are ones that embody a good idealization. So, either these practices are based on substantive assumptions or they are “mere” practices. If the latter, then the Bayes modeling is in itself of zero scientific interest. Does anyone really want to defend Bayes in this way? I confess that if this is the intent then there is nothing much to argue about given how modest (how really modest, how really really modest) the Bayes claim is.

Last, there is a tendency to insulate one’s work from criticism. One way of doing this is to refuse to defend the idealizations implicit in one’s technology. But technology is never innocent. It always embodies assumptions about the way the world is so that the technology used is a good technology in that it allows one to see/do things that other technologies do not permit or, at least, does not distort how the basic problems of interest are to be investigated. But researchers hate having to defend their technology, more often favoring the view that how it runs is its own defense. I have been arguing that this is incorrect. It does matter. So, if it turns out that Bayesians now are urging us to use the technology but are backing away from the idealizations implicit in it, that is good to know. This was not how it was initially sold. It was sold as a good way of developing level 1 cognitive theories. But if Bayes has no content then this is false. It cannot be the source of level 1 theories for on the revised version of Bayes as a “set of modeling practices” Bayes per se has no content so Bayes is not and cannot be a level 1 theory of anything. It is vacuous. Good to know. I would be happy if this is now widely conceded by our most eminent Bayesians. If this is now the current view of things, then there is nothing to argue about. If only Bayes had told us this sooner.

26 comments:

Noah MotionApril 11, 2016 at 12:51 PM
I'll just reiterate and elaborate on the point I made in the previous thread that in order for a model to be Bayesian, it just needs to use Bayes' rule. Nothing about Bayes' rule requires (a) a large hypothesis space or (b) use of all the relevant data. Bayes' rule can be applied to a small hypothesis space, and it can be applied incrementally.

Given a hypothesis space and some data, it will, per (c), be applied to all the available hypotheses. And, per (d), the decision rule is typically to pick the hypothesis with the highest posterior probability.

Of course, it is correct to point out both that (a) and (b) are typically employed along with (c) and (d) in Bayesian models of cognition, and that these are substantive claims about cognition. But neither is necessary or sufficient for a model to be Bayesian.
ReplyDelete
Replies
UnknownApril 11, 2016 at 3:59 PM
All this stuff about the size of the hypothesis space and how update works seems like cruft hanging on the theoretical distinction of interest, which is why I think people in the last post were balking at (3). The tenets in (3) are correlated with that distinction, but they don't constitute it.

For a (subjective) Bayesian, probabilities are (representative of) beliefs. This gets ported over to cognitive theorizing in the form of the following claim: probabilities are first class representational objects that learners manipulate over the course of learning. For the postulation of these objects to have any predictive power (at the level of the cognitive theory), they need to have consequences for, e.g., how learners carry forward uncertainty and ultimately select a hypothesis.

One way learners might carry forward uncertainty is by (i) tagging grammars with probabilities, subject to (a) some algebraic structure the space of grammars has and (b) some way of measuring that structure that satisfies certain rules (e.g., the Kolmogorov or De Finnetti axioms) and then (ii) updating those probabilities as data come in, again subject to those axioms and possibly some auxiliary assumptions. They might also carry forward uncertainty in ways that technically violate those axioms but within some approximation bounds. (More on approximation below.)

So, one way the picture above could be wrong is that learners don't carry forward uncertainty at all. For instance, they might just hop between states in the hypothesis space, guided by some algorithm that may or may not have theoretical guarantees wrt choosing the "optimal" grammar (whatever that means). And that algorithm may well exhibit properties in aggregate that do just what a learner that actually represented the probabilities do (or it might not). But that doesn't mean that the learner represents and manipulates the probabilities in any meaningful sense; it's just a fact about the structure of the algorithm.

I may be reading you and Yang wrong, Norbert, but for your particular argument, can't you allow back-off to the approximation claim (within reason)? Because what's important is the claim about probabilities as first class cognitive objects. And if someone with a level 1 Bayesian theory doesn't want to make *that* claim, I think I agree that there's no value added from the theory being Bayesian (whatever it means for a theory to be Bayesian in that case). But I disagree that it's problematic as a matter of methodology, since either way, you need tools for understanding how a particular algorithm relates to a particular algebraic structure and probability measure on that structure (presupposing the truism that "stats matter in making this choice"). Just like in syntactic theory, different notations may make it easier or harder to state particular generalizations, but in the end, one can do uninteresting work in any notation, since the notation isn't what's important.
ReplyDelete
Replies
UnknownApril 12, 2016 at 12:29 AM
I'm still not convinced that there's any particular cognitive theory that is committed to 3a-d. I'm especially surprised by the last point (When all data has been considered the rule is to choose that hypothesis in the space with the highest score) because it seems to me that the "most Bayesian" thing to do is to never choose any particular hypothesis, and rather marginalize over all of the hypotheses based on their posterior probability. It might be more productive to discuss one specific published proposal.
ReplyDelete
Replies
JonApril 12, 2016 at 3:47 AM
I think the distinction between normative and cognitive models is very important. I deal in pragmatics, where there is room for both types of models. Most "pure" models of neo-Gricean reasoning (like game-theoretic models) are normative in the sense that we expect people in the real world to reason imperfectly and deviate from what the model predicts is optimal. For example, implicatures may be predicted by a normative model which are in reality difficult to get, because people bring their own biased reasoning mechanisms to the table, and these often undercut Gricean reasoning. But nonetheless, there is room for normative models in pragmatics, because people do tend to follow conversational conventions that either match or approximate what is normatively optimal. For those cases, Bayesian models are appropriate.

But I have been somewhat critical (here) of the application of Bayesian cognitive modeling to pragmatic questions, though perhaps I'm more critical of its application in practice than in theory. That is to say, maybe it's possible to do it right, but it seems to me that often much of what's interesting gets dumped unceremoniously into the prior without further investigation. This creates the illusion that you've explained something when really you've just described it.

But this view---that Bayes is useful for normative models and not so useful for cognitive models---spells trouble for the application of Bayesianism to the rest of linguistics, as Charles does a good job pointing out. Thinking especially of the problem of syntax and language learning: What would a "normatively optimal grammar" look like? It's at worst an incoherent notion, and at best irrelevant to actually doing linguistics. The only way the learner can be seen to be "optimizing" her grammar is if a whole host of complex constraints are built into the prior in order to capture the fact that (i) the learner doesn't always use all of the data available to her, and (ii) the learner often makes inferences far beyond what the data has justified (i.e. poverty of stimulus). And if you were to do that, you'd want to investigate where these constraints on the prior come from (otherwise you haven't explained much). And that investigation, to me, is exactly what generative linguistics is in the first place. Is it possible to capture island constraints with Bayes' rule? Surely it is... but what's the use?
ReplyDelete
Replies
AveryAndrewsApril 14, 2016 at 4:43 PM
Backing off a bit from Bayesian technicalities, one important point in Charles' paper is the demonstration that there is a quantitative but not statistical (at least, not immediately statistical) factor in language acquisition, the Tolerance Principle.

On the other hand, the category of a-adjectives does seem like a special sub-part-of-speech of English, and I'm not getting a clear idea of what drives its acquisition, so is Bayes really dead here? Note for example that the a-adejctives have various '(often partially) functionally adequate substitutes' (I wanted to call them paronyms or paranyms, but both of these terms are already taken. So we can say 'a burning house' instead of 'an aflame house', or 'a noisy/babbling/crying/complaining/lively/... child' instead of 'an awake child' (when we wish to point out that a child is awake, it is usually because it is satisfying one or many of the above descriptors). So perhaps a grammar that recognizes this class and imposes the restriction is overall Bayesianly better than one that doesn't (but focussing on successful prediction of the next word, as it is my impression that Michael Ramscar and his group does, might be another way of thinking about it; I don't know if they amount to the same thing or not in the end).

& even if, technically, there is no there there for Bayes, it is I think still a reasonable basis for the thought that if you are an ordinary generative grammarian trying to 'tune' the formalism so that it makes it easier to describe stuff that does happen a lot, and harder or impossible to describe stuff that doesn't happen much or at all, your activities might well contribute to an explanation of why languages are learnable, and how.
ReplyDelete
Replies

Add comment