Tuesday, April 5, 2016

Yang on Bayes 1

This is the first of two posts on a recent paper by Charles Yang. Once again, the topic got away from me. I break it down into two parts to prevent you running away too scared to even look. I suspect that my clever maneuver won’t help. But I try.

One more caveat: this is my understanding of Charles’ paper. He should only be held responsible for what I say because he allowed me to read it, and that might be a culpable act.

Charles Yang has a new paper (CY) forthcoming in Language Acquisition and it is a must read. Aside from sketching a very powerful critique of contemporary Bayesianism as applied to linguistic problems (I will return to this critique momentarily), it also does something that I never thought I would witness in my lifetime; it makes quantitative predictions in a linguistic domain. And by “quantitative” I do not mean giving p-values or confidence intervals. I mean numerical predictions about the size of measurable effect. That’s quantitative! So, run, don’t walk to the above link and read the damn thing!

Ok, now that I’ve discharged my kudosing responsibilities, I want to discuss a second feature of CY. It offers an excellent critique of current Bayesian practice as applied to linguistic issues. As you all know, Bayes is a big player nowadays. In fact, I am tempted to say that Bayes has stepped into the position that Connectionism once held as the default theoretical framework in the cognitive sciences in general and the cognition of language in particular. As you may also recall, I have expressed reservations about Bayes and what it brings to the linguistics table (see here, here, here, here). However, it was not until I read CY that I clearly understood what bugs me about the Bayes framework as applied to my little domain of interests. I want to “share” this aha moment with you.

The conclusion can be put briskly as follows: Bayes is not so much wrong as wrong-headed. From a Generative Grammar (GG) perspective, it’s not the details that are off the mark (though they can be) but the idealization implicit in the framework that is (i.e. if you accept as I do that GG has correctly identified the “computational” problems (see here) then Bayes is of little relevance and maybe worse). And because this is so, there is very little we can learn from Bayesian modelings of linguistic problems. There is both (i) not enough there there and (ii) what there there is points in the wrong direction.

Let me put this point another way: all theories idealize. Empirically successful theories built on good idealizations.[1] CY’s argument is that Bayes is a bad idealization when applied to matters linguistic. If it is right (and, surprise surprise, I believe it is) then Bayes not only adds little, it positively misleads and misdirects. Why? Because bad idealizations cannot be empirically redeemed. It’s ok to be wrong (you can build on error given a good framing of the problem). But if a theory is wrongheaded it will impede progress.  Whereas an adequate conception of the problem (which is what idealizations embody) tolerates empirical missteps. You can’t data your way out of a misconceived idealization. Why not? Because if you’ve got the problem wrong then data coverage will come with a big price, ad hoc assumptions whose main purpose is to make up for the basic misconception. Hence, a misframing of the problem leads to explanatory sterility and misplaced efforts. That’s the claim. Let’s see how Bayes, misidealizes when considered form a GG perspective.

A Bayes model consists of 3 moving parts: (i) a hypothesis space delimiting the range of options (possibly weighted), (ii) a specification of the data relevant to the choosing of the right hypothesis, (iii) an update rule saying how to evaluate a given hypothesis given certain evidence. This 3-step procedure can iterate so that data can be evaluated incrementally.[2] What makes a Bayes account distinctive is not this 3-step process. After all, what do (i-iii) say beyond the uncontroversial truism that data is relevant to hypothesis acceptance?  No, what makes the Bayes picture distinctive is how it conceives of the hypothesis space, the relevant data and the update rule. This is what gives Bayes content and this CY argues is where Bayes misfires. It misidentifies the size of the hypothesis space, the amount of relevant data and the nature of the update rule. How exactly? As follows:

a.     Bayes misidentifies the size of the hypothesis space. In particular, Bayes takes as the default assumption that the hypothesis space is large. This makes the basic cognitive problem one of picking out the right hypothesis from a large number of (wrong) possibilities. This is a mis-idealization if within linguistic domains (e.g. acquisition) only a (very) small number of candidates are ever being evaluated wrt the input at any one time.  If the candidate set is small, then the “hard” problem is providing the right characterization of the restricted hypothesis space, not figuring out how to find the right theory in a large space of alternatives.[3]

b.     Bayes mischaracterizes the nature of the data exploited. In particular, Bayes idealizes the learning problem as trying to figure out how to use lots of complex information to find the right hypothesis in a large space. However, CY notes, G acquisition proceeds by using only a small part of the "relevant" data at any given time and there is not much of it. More specifically, the PLD that the child uses is severely restricted (sparse, degenerate and inadequate). Thus, in contrast to the full range of linguistic data that linguists use to find a native speaker’s G, kids only use a severely restricted data set when making zeroing in on their Gs. Thus, for the child, the problem is not finding a G in a large haystack of Gs using a very big pitchfork (that’s the linguist’s problem), but is more like skewering one G from among a small number of scattered Gs using a fragile toothpick (the PLD being multiply deficient). If this is correct, then the hard problem is again finding the structure of the G space so that pretty “weak” evidence (roughly sparsely scattered examples of main clause phenomena)[4] suffices to fix the right G for the LAD.

c.     Bayes uses all the data to evaluate all the hypotheses at every iterative step. The way that Bayes models work is that every hypothesis in the space is evaluated wrt all of the data at any given point (i.e. cross situational learning). So, add new data and we update every hypothesis wrt that data. We might describe this as follows: the procedure takes a panoramic view of the updating function. Contrast this to a procedure where only one theory (or two) is ever seriously being considered at any one step and that alternatives are considered only when the favored one(s) fails (see here). On this second view, evaluation of hypotheses is severely myopic with virtually all but a very few number of alternatives ever being considered. Moreover, the myopia might be yet more severe: the alternatives are not chosen among the most highly valued alternatives but randomly. So, if H1 fails then the procedure does not opt for H2 because it is the next best theory to that point, but the rule is to just pick another hypothesis at random. So, not only is the procedure myopic, it is pretty dumb. Blind and dumb and thus not very rational.

d.     Bayes misunderstands the decision rule to be an optimizing function.  A symptom of this is misunderstanding is the problem Bayes has in accounting for the widespread phenomenon of probability matching (PM). Indeed, Bayes doesn’t explain PM. It explains it away. Why? Because Bayes alone cannot explain it. Bayesian inference is inconsistent with PM. Left to its own devices, Bayes implies that agents will select the option that maximizes the posterior rather than split the difference probabilistically between many different options. But the latter is what PM does (PM implies splitting the difference in accord with the probability of the options). To deal with this (and PM is pervasive), Bayes adds further assumptions (some might argue ad hoc assumptions (see below)) to allow the maximizing rule to result in probability matching. If so, this suggests that the Bayesian idealization without further supplementation points one in the wrong direction (see here for discussion). Much preferred would be an update rule that allows PM. Such rules exist and CY discusses them.

It is worth noting that Bayes makes these substantive assumptions for principled reasons. Bayes’ started out as a normative theory of rationality. Bayes was developed as a formalization of the notion “inference to the best explanation.” In this context the big hypothesis space, full data, full updating, optimizing rule structure noted above are reasonable for they undergird a the following principle of rationality: choose that theory which among all possible theories best matches the full range of possible data. This makes sense as a normative principle of inductive rationality. The only caveat is that Bayes seems to be too demanding for humans. There are considerable computational costs to Bayesian theories (again CY notes these) and it is unclear how good a principle of rationality is if humans cannot apply it. However, whatever its virtues as a normative theory, it is of dubious value in what L.J. Savage (one the founders of modern Bayesianism) termed “small world” situations (SWS).

[1] For example, the view that there is an infinite number of natural language objects is an idealization. It is certainly false that humans can deal manage sentences with, say, 40 levels of embedding. So there is some upper bound on the number of sentences native speakers can effectively manage. Thus we have no behavioral evidence that human linguistic capacity is in fact infinite in the required sense. However, this is not really relevant to the utility of the idealization. What matters is whether it makes the central problem vivid (and tractable) and it does. The basic fact about human linguistic competence is that we can use and understand sentences never before encountered. The question is how we extend beyond what we have been exposed to do this (i.e. the projection problem). It matters not a whit if the domain of our competence extends to an infinite set of objects or to just a very very very large one. Or even small one for that matter. What matters is that we project beyond what we have heard to sentences original to us. If we can do this, we need rules. The infinity assumption vividly highlights the need for rules or Gs in any account of linguistic competence. Thus, in that sense, it is a fine idealization even if, perhaps, false.

I should add, that I am not sure that it is false in the relevant sense. After all, humans have numerical competence even though I doubt that we do well with integers 100,000,000 digits long. My point is that even if false, the idealization is an excellent one for it highlights the problem that needs solving and that whatever solution we come up for making this assumption will extend naturally to the more realistic one. Solving the projection problem in the “larger” domain will serve to solve it in the smaller.
[2] Note, that a procedure is iterative does not mean that it is incremental in the interesting sense that we want from our acquisition models. Incremental in the latter means that the more data we get then the closer we get to the true theory. It means that as we get more data we don’t bounce around the hypothesis space from G to G. Rather as we get more data we smoothly hone in on the right G. It is possible for a theory to be iterative without being incremental. Dresher and Kaye have good discussions of this and the assumption (idealized) that there is instantaneous learning within parameter setting models. The latter idealization makes sense if in fact there is no smooth relation between more data and closing in on the right G (though this is not its only virtue). As we would like our theories to be incremental figuring out how to make this so in, for example, a parameter setting model, is a very interesting question (and the one that Dresher and Kaye focus on).
[3] Moreover, finding the right answer in a restricted space need not be identical to finding the right answer in a large space. Finding your keys on your desk (notice I said “your” not “my”) is not obviously the same problem as finding a needle in a haystack.
[4] CY has an excellent discussion of just how sparse things can be. Standard PoS arguments note how deficient it is relative to much of the knowledge attained.

No comments:

Post a Comment