Faculty of Language: idealizations

Showing posts with label idealizations. Show all posts

Tuesday, April 19, 2016

Indirect negative evidence

One of the features of Charles’ paper (CY) that I did not comment on before and that I would like to bring to your attention here is the relevance (or, more accurately, lack thereof) of indirect negative evidence (INE) for real time acquisition. CY’s claim is that it is largely toothless and unlikely to play much of a role in explaining how kids acquire their Gs. A few comments.

CY is not the first time I have been privy to this observation. I recall that my “good and great friend” Elan Dresher said as much when he was working on learnability of stress with Jonathan Kaye. He noted (p.c.) that very few Gs lined up in the relevant sub/super set configuration relevant for an application of the principle. Thus, though it is logically possible that INE could provide info to the LAD for zeroing in on the right G, in fact it was all but useless given the nature of the parameter space and the Gs that such a space supports. So, nice try INE, but no cigar.[1]

CY makes this point elaborately. It notes several problems with INE as, for example, embodied in Bayes models (see p 14-15).

First, generating the sets necessary to make the INE comparison is computationally expensive. CY cites work by Osherson et al (1986) noting that generating such sets may not even be computable and by Fodor and Sakas that crunches the numbers in cases with a finite set of G alternatives and finds that here too com putting the extensions of the relevant Gs in order to apply the INE is computationally costly.

Nor should this be surprising. If even updating several Gs wrt data quickly gets out of computational control, then it is hardly surprising that using Gs to generate sets of outputs and then comparing them wrt containment is computationally demanding. In sum, surprise, surprise, INE runs into the same kind to tractability issues that Bayes is already rife with.[2]

Second, and maybe more interesting still, CY diagnoses why it is that INE is not useful in real world contexts. Here is CY (note: ‘super-hypothesis’ is what some call the supersets):

The fundamental problem can be stated simply: the super-hypothesis cannot be effectively ruled out due to the statistical properties of child directed English. (16)

What exactly is the source of the problem? Zipf’s law.

The failure of indirect negative evidence can be attributed to the inherent statistical distribution of language. Under Zipf’s law, which applies to linguistic units (e.g. words) as well as their combinations (e.g. N-grams, phrases, rules; see Yang (2013)), it is very difficult to differentiate low probability events and impossible events.

And this makes it inadvisable to use the absence of a particular form as evidence of its non-generability. In other words, Zipf’s law cuts the ground from under INE.

Here CY (as it notes) is making a point quite similar to that made over 25 years ago by Steve Pinker (here) (14):

…it turns out to be far from clear what indirect negative evidence could be. It can’t be true that the child literally rules out any sentence he or she hasn’t heard, because there is always an infinity of sentences that he or she hasn’t heard that are grammatical …And it is trivially true that the child picks hypothesis grammars that rule out some of the sentences that he or she hasn’t heard, and that if a child hears a sentence she or she will often entertain a different hypothesis grammar that if she or she hasn’t heard it. So the question is, under exactly what circumstances does a child conclude that a non witnessed sentence is ungrammatical?

What CY notes is that this is not only a conceptual possibility given the infinite number of grammatical linguistic objects, but it is statistically likely that because of the Zipfian distribution of linguistic forms in the PLD that the evidence relevant to concluding G absence from statistical absence (or rarity) will be very spotty, and that building on such absence will lead in very unfortunate directions. CY discusses a nice case of this wrt adjectives, but the point is quite general. It seems like Zipf’s law makes relying on gaps in the data to make conclusions about (il)licit grammatical structures a bad strategy.

This a very nice point, which is why I have belabored it. So, not only are the computations intractable but the evidence relevant for using INE is inadequate for principled reasons. Conclusion, forget about the INE.

Why mention this? It is yet another problem with Bayes. Or, more directly, it suggests that the premier theoretical virtue of Bayes (the one that gets cited whenever I talk to a Bayesian) is empirically nugatory. Bayes incorporates the subset principle (i.e. Bayesian reasoning can explain why the subset principle makes sense). This might seem like a nice feature. And it would be were INE actually an important feature of the LAD’s learning strategy (i.e. a principle that guided learning). But, it seems that it is not. It cannot be used both for computational and statistical reasons. Thus, it is a strike against any theory of the ideal learner that it incorporates the subset principle in a principled manner. Why? Because, the idealization points in the wrong direction. It suggests that negative evidence is important to the LAD in getting to its G. But if this is false, then a theory that incorporates it in a principled fashion is, at best, misleading. And being misleading is a major strike against an idealization. So, bad idealization! Again!

And it’s worse still because there is an alterative? Here’s CY (18):

The alternative strategy is a positive one, as it exploits the distributional similarities … Under this approach, the over-hypothesis is never available to the learner, and there is no need to rule it out.

So, frame the problem well (i.e. adopt the right idealization) and you point yourself in the right direction (i.e. by avoiding dealing with problems that the wrong idealization generates).

As CY notes, none of these arguments are “decisive.” Arguments against idealizations never are (though the ones CY presents and that I have rehearsed wrt Bayes in the last several posts seems to me pretty close to dispositive). But, they are important. Like all matters scientific, idealizations need to be defended. One way to defend them is to note that they point to the right kinds of problems and suggest how the kinds of solutions we ought to explore. If an idealization consistently points in the wrong direction, then it’s time to chuck it. It’s worse than false, it is counter-productive. In the domain of language, whatever the uses of the technology Bayes makes available, it looks like it is misleading in every possible way. The best that we seem to be able to say for it is that if we don’t take any of its claims seriously then it won’t cause too much trouble. Wow, what an endorsement. Time to let the thing go and declare the “revolution” over. Let’s say this loudly and all together: Bye bye Bayes!

[1] It is worth noting that the Dresher-Kaye system was pretty small, about 10 parameters. Even in this small system, the subset principle proved to be idle.

[2] In fact, it might be worse in this case. The Bayes maneuver generally circumvents the tractability issue by looking for algorithms that can serve to “update” the hypotheses without actually directly updating them. For INE we will need cheap algorithms to generate the required sets and then compare them. Do such quick and dirty algorithms exist for generation and comparison of the extensions of hypotheses?

Monday, April 11, 2016

Pouring gasoline on the flames: Yang on Bayes 3

I want to pour some oil on the flames. Which flames? The ones that I had hoped that my two recent posts on Yang’s critique of Bayes (here and here) would engender. There has been some mild pushback (from Ewan, Tal, Alex and Avery). But the comments section has been pretty quiet. I want to restate what I take to be the heart of the critique because, if correct, it is very important. If correct, it suggests that there is nothing worth salvaging from the Bayes “revolution” for there is no there there. Let me repeat this. If Yang is right, then Bayes is a dead end with no redeeming scientific (as opposed to orthographic) value. This does not mean that specific Bayes proposals are worthless. They may not be. What it means is that Bayes per se not only adds nothing to the discussion, but that taking its tenets to heart will mislead inquiry. How so? It endorses the wrong idealization of how stats are relevant to cognition. And misidealizations are as big a mistake as one can make, scientifically speaking. Here’s the bare bones of the argument.

1. Everyone agrees that data matters for hypothesis choice

2. Everyone agrees that stats matter in making this choice

3. Bayes makes 4 specific claims about how stats matter for hypothesis choice:

a. The hypothesis space is cast very wide. In the limit all possible hypothesis are in the space of options

b. All potentially relevant data is considered, i.e. any data that could decide between competing hypotheses is used to adjudicate among the hypotheses in the space

c. All hypotheses are evaluated wrt to all of the data. So, as data is considered every hypothesis’ chance of being true is evaluated wrt to every data point considered

d. When all data has been considered the rule is to choose that hypothesis in the space with the highest score

Two things are worth noting about the above.

First, that (3) provides serious content to a Bayesian theory, unlike (1) and (2). The latter are trivial in that nobody has ever thought otherwise. Nobody. Ever. So if this is the point of Bayes, then this ain’t no revolution!

Second, (3) has serious normative motivation. It is a good analysis of what kind of inference an inference to the best explanation might be. Normatively, an explanation is best if it is better than all other possible explanations and accounts for all of the possibly relevant data. Ideally, this implies evaluating all alternatives wrt to all the data and choosing the best. This gives us (3a-d). Cognitive Bayes (CB) is the hypothesis that normative Bayes (NB) is a reasonable idealization for people actually do when the learn/acquire something. And we should appreciate that this could be the case. Let’s consider how for a moment.

The idealization would make sense for the following kind of case (let’s restrict ourselves to language). Say that the hypothesis space of a potential Gs was quite big. For concreteness, say that we were always considering about 50 different candidate Gs. This is not all possible Gs, but 50 is a pretty big number computationally speaking. So say 50 or more alternatives is the norm. Then Bayes (3a) would function a lot like the standard linguistic assumption that the set of well-formed syntactic objects in a given language is effectively infinite. Let me unpack the analogy.

This infinity assumption need not be accurate to be a good idealization. Say it turns out that the number of well-formed sentences a native speaker of English is competent wrt is “only” 10¹⁰⁰⁰. Wouldn’t this invalidate the infinity assumption? No, it would show that it is false, but not that it is a bad idealization. Why? Because the idealization is a good one because it focuses attention onto the right problem. Which one? The Projection Problem: how do native speakers go from a part of the language all of it? How given exposure to only a subset of the language does a LAD get mastery over a whole language? The answer: you acquire recursive rules, a G, that’s how. And this is true whether or not the “language” is infinite or just very big. The problem, going from a subset to its containing superset, will transit via a specification of rules whether or not the set is actually infinite. All the infinite idealization does is concentrate the mind on the projection problem by making the alternative tempting idea (learning by listing) silly. This is what Chomsky means when he says in Current Issues” “once we have mastered a language, the class of sentences with which we can operate fluently or hesitation is so vast that for all practical purposes (and, obviously, for all theoretical purposes), we may regard it as infinite” (7, my emphasis NH). See: the idealization is reasonable because it does not materially change the problem to be solved (i.e. how to go from part of the language you are exposed to, to the whole language that you have mastery over).

A similar claim could be true of Bayes. Yes, the domain of Gs a LAD considers is in fact big. Maybe not thousands or millions of alternatives, but big enough to be worth idealizing to a big hypothesis space in the same way that it is worth assuming that the class sentences a native speaker is competent wrt is infinite. Is this so? Probably not. Why not? Because even moderately large hypothesis spaces (say with over 5 competing alternatives) turns out to be very hard to manage. So the standard practice is to use really truncated spaces, really small SWSs. But when you so radically truncate the space, there is no reason to think that the inductive problem remains the same. Just think if the number of sentences we actually knew was about 5 (roughly what happens in animal communication systems). Would the necessity of rules really be obvious? Might we not reject the idealization Chomky argues for (and note that I emphasize ‘argue’)? So, rejecting (3a) means rejecting part of the Bayes idealization.

What of the other parts, (3b-d)? Well, as I noted in my posts, Charles argues that each and every one is wrong in such a way as to be not worth making. It gets the shape of the problem wrong. He may be right. He may be wrong (not really, IMO), but he makes an argument. And if he is right, then what’s at stake is the utility of RB as a useful idealization for cognitive purposes. And, if you accept this, we are left with (1-2), which is methodological pablum.

I noted one other thing the normative idealization above was once considered as a cognitive option within linguistics. It was knows as the child-as-little-linguist theory. And it had exactly the same problems that Bayes has. It suggests that what kids do is what linguists do. But it is not the same thing at all. And realizing this helped focus on what the problem the LAD faces is. Bayes is not unique in misidealizing a problem.

Three more points and I end today’s diatribe.

First, one can pick and choose among the four features above. In other words, there is no law saying that one must choose the various assumptions as a package. One can adopt a SWS assumption (rejecting 3a) while adopting a panoramic view of the updating function (assuming that every hypothesis in the space is updated wrt every new data point) and rejecting choice optimization (3d). In other words, mixing and matching is fine and worth exploring. But what gives Bayes content, and makes it more than one of many bookkeeping notations, is the idealization implicit in CB as NB.

Second, what makes Bayes scientifically interesting is the idealization implicit in it. I mention this because as Tal notes in a comment (here), it seems that current Bayesians are promoting their views as just “set of modeling practices.” The ‘just’ is mine, but this seems to me what Tal is indicating about the paper he links to. But the “just” matters. Modeling practices are scientifically interesting to the degree that they embody ideas about the problem being modeled. The good ones are ones that embody a good idealization. So, either these practices are based on substantive assumptions or they are “mere” practices. If the latter, then the Bayes modeling is in itself of zero scientific interest. Does anyone really want to defend Bayes in this way? I confess that if this is the intent then there is nothing much to argue about given how modest (how really modest, how really really modest) the Bayes claim is.

Last, there is a tendency to insulate one’s work from criticism. One way of doing this is to refuse to defend the idealizations implicit in one’s technology. But technology is never innocent. It always embodies assumptions about the way the world is so that the technology used is a good technology in that it allows one to see/do things that other technologies do not permit or, at least, does not distort how the basic problems of interest are to be investigated. But researchers hate having to defend their technology, more often favoring the view that how it runs is its own defense. I have been arguing that this is incorrect. It does matter. So, if it turns out that Bayesians now are urging us to use the technology but are backing away from the idealizations implicit in it, that is good to know. This was not how it was initially sold. It was sold as a good way of developing level 1 cognitive theories. But if Bayes has no content then this is false. It cannot be the source of level 1 theories for on the revised version of Bayes as a “set of modeling practices” Bayes per se has no content so Bayes is not and cannot be a level 1 theory of anything. It is vacuous. Good to know. I would be happy if this is now widely conceded by our most eminent Bayesians. If this is now the current view of things, then there is nothing to argue about. If only Bayes had told us this sooner.

Faculty of Language

Comments

Tuesday, April 19, 2016

Indirect negative evidence

Monday, April 11, 2016

Pouring gasoline on the flames: Yang on Bayes 3

Contributors