This
is the first of two posts on a recent paper by Charles Yang. Once again, the
topic got away from me. I break it down into two parts to prevent you running
away too scared to even look. I suspect that my clever maneuver won’t help. But I
try.

One
more caveat: this is

*my*understanding of Charles’ paper. He should only be held responsible for what I say because he allowed me to read it, and that might be a culpable act.
Charles
Yang has a new paper (CY) forthcoming in

*Language Acquisition*and it is a must read. Aside from sketching a very powerful critique of contemporary Bayesianism as applied to linguistic problems (I will return to this critique momentarily), it also does something that I never thought I would witness in my lifetime; it makes*quantitative*predictions in a linguistic domain. And by “quantitative” I do*not*mean giving p-values or confidence intervals. I mean numerical predictions about the size of measurable effect. That’s quantitative! So, run, don’t walk to the above link and read the damn thing!
Ok,
now that I’ve discharged my kudosing responsibilities, I want to discuss a
second feature of CY. It offers an excellent critique of current Bayesian
practice as applied to linguistic issues. As you all know, Bayes is a big
player nowadays. In fact, I am tempted to say that Bayes has stepped into the
position that Connectionism once held as the default theoretical framework in
the cognitive sciences in general and the cognition of language in particular. As
you may also recall, I have expressed reservations about Bayes and what it
brings to the linguistics table (see here, here, here, here). However, it was not until I read CY that I clearly understood
what bugs me about the Bayes framework as applied to my little domain of
interests. I want to “share” this aha moment with you.

The
conclusion can be put briskly as follows: Bayes is not so much wrong as
wrong-headed. From a Generative Grammar (GG) perspective, it’s not the details
that are off the mark (though they can be) but the

*idealization*implicit in the framework that is (i.e. if you accept as I do that GG has correctly identified the “computational” problems (see here) then Bayes is of little relevance and maybe worse). And because this is so, there is very little we can learn from Bayesian modelings of linguistic problems. There is both (i) not enough there there and (ii) what there there is points in the wrong direction.
Let me
put this point another way: all theories idealize. Empirically successful
theories built on good idealizations.[1] CY’s argument is that Bayes
is a bad idealization when applied to matters linguistic. If it is right (and,
surprise surprise, I believe it is) then Bayes not only adds little, it
positively misleads and misdirects. Why? Because bad idealizations cannot be
empirically redeemed. It’s ok to be wrong (you can build on error given a good
framing of the problem). But if a theory is wrongheaded it will impede
progress. Whereas an adequate conception
of the problem (which is what idealizations embody) tolerates empirical
missteps. You can’t data your way out of a misconceived idealization. Why not?
Because if you’ve got the problem wrong then data coverage will come with a big
price, ad hoc assumptions whose main purpose is to make up for the basic misconception.
Hence, a misframing of the problem leads to explanatory sterility and misplaced
efforts. That’s the claim. Let’s see

*how*Bayes, misidealizes when considered form a GG perspective.
A
Bayes model consists of 3 moving parts: (i) a hypothesis space delimiting the
range of options (possibly weighted), (ii) a specification of the data relevant
to the choosing of the right hypothesis, (iii) an update rule saying how to
evaluate a given hypothesis given certain evidence. This 3-step procedure can
iterate so that data can be evaluated incrementally.[2] What makes a Bayes account
distinctive is

*not*this 3-step process. After all, what do (i-iii) say beyond the uncontroversial truism that data is relevant to hypothesis acceptance? No, what makes the Bayes picture distinctive is*how*it conceives of the hypothesis space, the relevant data and the update rule. This is what gives Bayes content and this CY argues is where Bayes misfires. It misidentifies the*size*of the hypothesis space, the*amount*of relevant data and the*nature*of the update rule. How exactly? As follows:
a.

**. In particular, Bayes takes as the default assumption that the hypothesis space is large. This makes the basic cognitive problem one of picking out the right hypothesis from a large number of (wrong) possibilities. This is a mis-idealization if within linguistic domains (e.g. acquisition) only a (very) small number of candidates are ever being evaluated wrt the input at any one time. If the candidate set is small, then the “hard” problem is providing the right characterization of the restricted hypothesis space, not figuring out how to find the right theory in a large space of alternatives.[3]***Bayes misidentifies the size of the hypothesis space*
b.

**. In particular, Bayes idealizes the learning problem as trying to figure out how to use lots of complex information to find the right hypothesis in a large space. However, CY notes, G acquisition proceeds by using only a small part of the "relevant" data at any given time and there is not much of it. More specifically, the PLD that the child uses is severely restricted (sparse, degenerate and inadequate). Thus, in contrast to the full range of linguistic data that linguists use to find a native speaker’s G, kids only use a severely restricted data set when making zeroing in on their Gs. Thus, for the child, the problem is not finding a G in a large haystack of Gs using a very big pitchfork (that’s the linguist’s problem), but is more like skewering one G from among a small number of scattered Gs using a fragile toothpick (the PLD being multiply deficient). If this is correct, then the hard problem is again finding the structure of the G space so that pretty “weak” evidence (roughly sparsely scattered examples of main clause phenomena)[4] suffices to fix the right G for the LAD.***Bayes mischaracterizes the nature of the data exploited*
c.

*Bayes uses all the data to evaluate all the hypotheses at every iterative step***.**The way that Bayes models work is that every hypothesis in the space is evaluated wrt all of the data at any given point (i.e. cross situational learning). So, add new data and we update every hypothesis wrt that data. We might describe this as follows: the procedure takes a*panoramic*view of the updating function. Contrast this to a procedure where only one theory (or two) is ever seriously being considered at any one step and that alternatives are considered only when the favored one(s) fails (see here). On this second view, evaluation of hypotheses is severely*myopic*with virtually all but a very few number of alternatives ever being considered. Moreover, the myopia might be yet more severe: the alternatives are not chosen among the most highly valued alternatives but randomly. So, if H1 fails then the procedure does not opt for H2 because it is the next best theory to that point, but the rule is to just pick another hypothesis at random. So, not only is the procedure myopic, it is pretty dumb. Blind and dumb and thus not very rational.
d.

**. A symptom of this is misunderstanding is the problem Bayes has in accounting for the widespread phenomenon of probability matching (PM). Indeed, Bayes doesn’t explain PM. It explains it away. Why? Because Bayes***Bayes misunderstands the decision rule to be an optimizing function**alone*cannot explain it. Bayesian inference is inconsistent with PM. Left to its own devices, Bayes implies that agents will select the option that*maximizes*the posterior rather than split the difference probabilistically between many different options. But the latter is what PM does (PM implies splitting the difference in accord with the probability of the options). To deal with this (and PM is pervasive), Bayes adds further assumptions (some might argue*ad hoc*assumptions (see below)) to allow the maximizing rule to result in probability matching. If so, this suggests that the Bayesian idealization without further supplementation points one in the wrong direction (see here for discussion). Much preferred would be an update rule that allows PM. Such rules exist and CY discusses them.
It is
worth noting that Bayes makes these substantive assumptions for

*principled*reasons. Bayes’ started out as a normative theory of rationality. Bayes was developed as a formalization of the notion “inference to the best explanation.” In this context the big hypothesis space, full data, full updating, optimizing rule structure noted above are*reasonable*for they undergird a the following principle of rationality: choose that theory which among all possible theories best matches the full range of possible data. This makes sense as a normative principle of inductive rationality. The only caveat is that Bayes seems to be too demanding for humans. There are considerable computational costs to Bayesian theories (again CY notes these) and it is unclear how good a principle of rationality is if humans cannot apply it. However, whatever its virtues as a normative theory, it is of dubious value in what L.J. Savage (one the founders of modern Bayesianism) termed “small world” situations (SWS).
[1]
For example, the view that there is an infinite number of natural language
objects is an idealization. It is certainly false that humans can deal manage
sentences with, say, 40 levels of embedding. So there is some upper bound on
the number of sentences native speakers can effectively manage. Thus we have

*no*behavioral evidence that human linguistic capacity is in fact infinite in the required sense. However, this is not really relevant to the utility of the idealization. What matters is whether it makes the central problem vivid (and tractable) and it does. The basic fact about human linguistic competence is that we can use and understand sentences never before encountered. The question is how we extend beyond what we have been exposed to do this (i.e. the projection problem). It matters not a whit if the domain of our competence extends to an infinite set of objects or to just a very very very large one. Or even small one for that matter. What matters is that we project beyond what we have heard to sentences original to us. If we can do this, we need rules. The infinity assumption vividly highlights the need for rules or Gs in any account of linguistic competence. Thus, in that sense, it is a fine idealization even if, perhaps, false.
I should add, that I am not sure that it is false in
the relevant sense. After all, humans have numerical competence even though I
doubt that we do well with integers 100,000,000 digits long. My point is that

*even if*false, the idealization is an excellent one for it highlights the problem that needs solving and that whatever solution we come up for making this assumption will extend naturally to the more realistic one. Solving the projection problem in the “larger” domain will serve to solve it in the smaller.
[2]
Note, that a procedure is iterative does not mean that it is incremental in the
interesting sense that we want from our acquisition models. Incremental in the
latter means that the more data we get then the closer we get to the true
theory. It means that as we get more data we don’t bounce around the hypothesis
space from G to G. Rather as we get more data we smoothly hone in on the right
G. It is possible for a theory to be iterative without being incremental.
Dresher and Kaye have good discussions of this and the assumption (idealized)
that there is instantaneous learning within parameter setting models. The
latter idealization makes sense if in fact there is no smooth relation between
more data and closing in on the right G (though this is not its only virtue).
As we would like our theories to be incremental figuring out how to make this
so in, for example, a parameter setting model, is a very interesting question
(and the one that Dresher and Kaye focus on).

[3]
Moreover, finding the right answer in a restricted space need not be identical
to finding the right answer in a large space. Finding your keys on your desk
(notice I said “your” not “my”) is not obviously the same problem as finding a
needle in a haystack.

[4]
CY has an excellent discussion of just how sparse things can be. Standard PoS
arguments note how deficient it is relative to much of the knowledge attained.

## No comments:

## Post a Comment