Faculty of Language: Yang on Bayes 2

Thursday, April 7, 2016

Yang on Bayes 2

Here is part 2. I have included the previous paragraph so that you can get a running start into the discussion. Let me remind you that Charles is culpable for the content of what follows only insofar as reading his stuff stimulated me to think as I did. In other words, very culpable.

It is worth noting that Bayes makes these substantive assumptions for principled reasons. Bayes’ started out as a normative theory of rationality. Bayes was developed as a formalization of the notion “inference to the best explanation.” In this context the big hypothesis space, full data, full updating, optimizing rule structure noted above are reasonable for they undergird a the following principle of rationality: choose that theory which among all possible theories best matches the full range of possible data. This makes sense as a normative principle of inductive rationality. The only caveat is that Bayes seems to be too demanding for humans. There are considerable computational costs to Bayesian theories (again CY notes these) and it is unclear how good a principle of rationality is if humans cannot apply it. However, whatever its virtues as a normative theory, it is of dubious value in what L.J. Savage (one the founders of modern Bayesianism) termed “small world” situations (SWS).

What are SWSs? They are situations where the hypothesis space, the space of options over which the probabilities are defined, is small. In SWSs the central problem is to describe the structure of the “small world.” All else is secondary. So in SWSs useful idealizations will help focus on these structures. And that is the problem with Bayes. As Pat Suppes notes (CY quotes from this paper p. 47, my emphasis NH)):

…any theory of complex problem solving cannot go far simply on the basis of Bayesian decision notions of information processing. The core of the problem is that of developing an adequate psychological theory to describe, analyze and predict the structure imposed by organisms on the bewildering complexities of possible alternatives facing them. The simple concept of an a priori distribution over these alternatives is by no means sufficient and does little toward offering a solution to any complex problem.

…understanding the structures actually used is important for an adequate descriptive theory of behavior….As…inductive logic comes to grips with more realistic problems, the overwhelming combinatorial possibilities that arise in any complex problem will make the need for higher-order structural assumptions self-evident.

In short, the hard problem is to find the structure of the SWSs (to describe the restricted hypothesis space) and Bayes brings little (nothing?) of relevance to the table for solving this problem. Bayes does not prevent considerations of this problem, but it does not promote them or make them central concerns either. And I would argue that by not recognizing (or, worse, radically deemphasizing) the SWS character of most psychological processes, Bayes deflects attention from the hard problem, the one that must be solved if there is to be any progress, onto secondary concerns. That’s the standard cost of a misidealization; it leads you to look in the wrong place.

Let me put this another way, echoing Suppes and CY. The hard problem is “the overwhelming combinatorial possibilities that arise in any complex problem.” The Bayes idealization abstracts away from this at every step: its default assumptions are big hypothesis spaces, large data, panoramic update and optimization. Each of these causes problems. If the cognitive reality (at least in language, but I suspect everywhere) is small worlds, very limited data, myopic (and dumb) updating and satisficing decision then Bayes is just misleading. Moreover, starting where Bayes does means that the bulk of the work will not be done by the Bayes part of any account, but by the algorithms that massage these assumptions out. And indeed, this is what we find when Bayesians respond to their critics. Here’s an example of this move that CY discusses.

CY notes that when confronted with problems Bayesians usually go Marrian. So, for example, when it is observed that Bayes is inconsistent with probability matching, the retort is that this is only so for the level 1 theory. The level 2 algorithms allow for probability matching when certain further constraints are added to the optimizing function. However, if the critique of Bayes is that the idealization is the wrong one, then the fact that one can patch up the problem by saying more is not so much a sign of health but possible confirmation of the initial wrong turn. Of course you can patch anything up with further machinery. This is never in doubt (or almost never). The question is not whether this is possible, but whether the patching up is consistent with the spirit of the initial idealization. If it is, then no problem. If it isn’t, then it argues against the idealization. CY argues that how Bayes handles probability matching goes against the Bayesian spirit of the initial idealization (see the CY discussion of Steyvers et al p 12).

There are many critiques of Bayes that go along the same lines. Many note that Bayes stories have an ad hoc flavor to them (see here and references in linked paper). One critic, Clark Glymour (who, BTW, is no schnook (see here)), has described the work as “Ptolemaic” (see here) and not in a good way (is there a good way?). In the cases discussed, the ad hoc flavor comes from specific assumptions added to the basic Bayes account to accommodate the recalcitrant data, or so the critics argue. It is above my pay grade to evaluate whether the critics are correct in each case. However, this is what we expect if the basic Bayes account is based on a misidealization of the relevant problems.

In fact, this is precisely how we now understand the failure of Ptolemaic astronomy. Think epicycles. Why the epicyles? Because the whole Ptolemaic project starts from two completely wrong idealizations; that the universe is geocentric and that orbits are circular. Starting here, the astronomical data demands epicycles and equants and all the other geometric paraphernalia. Change to a heliocentric system and allow elliptical orbits and all this apparatus disappears. The ad hoc apparatus is inevitable given the initial idealization of the problem. Given the starting assumptions it is possible to “model” all planetary motion (in fact, we now know that any possible set of orbits can me modeled). However, it is equally clear that in this case tracking the data is not a virtue. Wrong idealization, wrong theory no matter how well it can be made to fit the data.

Let me end this (again) overly long post, with a few observations.

First, arguing against an idealization is very difficult. Why? Because it is often possible to “fix” the problems that a misidealization generates by adding further bells and whistles. Thus, it is not generally possible to argue against an idealization empirically. Rather the argument is necessarily more subtle: the accounts delivered are non explanatory, they involve too much ad hoc machinery, they are too complex, they don’t fit well with other things we want etc. Despite their difficulty, critiques of idealizations are one of the most important kinds of scientific arguments. They are hardly ever dispositive, but they are extremely important precisely because they expose the basic lay of the land and identify the hard problems that need solutions. They are also the kinds of arguments that can make you a scientific immortal (think Newton on contact mechanics and Einstein in the aether).

Second, CY presents an interesting Marr like argument against Bayes. It proposes that level 1 theories that have transparent relations to level 2 accounts are better than those that don’t (see the prior three posts on Marr for some discussion). CY argues that Bayes level 1 and 2 theories are conceptually caliginous because of the initial misidealizaation. Starting from the wrong place will make it hard to transparently relate level 1 computational theories with level 2 theories of algorithms and representations. CY makes such an argument as follows.

It is well known that Bayes procedures taken transparently are computationally intractable. CY observes that to date, even the non-transparent ones that have been proposed are pretty bad (i.e. very slow). This is not a surprise in a Marrian setting if the idealization pursued by Bayes is wrongheaded. Indeed, in a Marr setting one might argue that not being able to generate “nice” algorithms (transparent ones being particularly nice) is a sign that the computational level theory is heading in the wrong direction. The fact that CY’s proposed algorithm is both tractable and easily implementable and that Bayesian proposals are not is a strong argument against the adequacy of the latter idealization in a Marr setting given the Marrian desideratum that computational level theories inform and constrain level 2 algorithmic theories.

CY’s argument goes further. CY’s proposed theory provides a dynamics for acquisition: it not only can tell you how the data determines the G you end up with, but it also tells you how to traverse the set of alternatives available as the data comes in. So, the theory tells you how hypotheses change over real time. Bayes does not do this. It just specifies where you end up given the data available. Bayes is iterative, but not incremental. CY’s theory is both.

Interestingly, CY derives this dynamic incrementality via a transparency assumption regarding the Elsewhere Principle. This is something that a Bayes account cannot possibly achieve precisely because it is well known to be intractable when understood transparently. In other words, CY can do more than Bayes because its understanding of the level 1 problem allows for a very neat/transparent mapping to a level 2 account. This is unavailable for Bayes because the transparency assumption in a Bayes theory leads to intractable computations. So, wrong idealization, less transparency and so no obvious dynamics.

Third, the Bayes idealization embodies a confusion linguists should be familiar with. Remember the child as little linguist hypothesis?[1] It’s the idea that the way to think of how LADs acquired Gs is on a par with how linguists discover the properties of particular Gs. This, we know, is wrong on at least two grounds: first, linguists when they operate consider a wide range of theoretical alternatives for any given "problem" and second linguistic consider full linguistic data to solve their problems while children only consider a small subset of potentially useful data, the PLD. Thus, in two important ways, children inhabit a “smaller world” than linguists do and consider much scantier data in reaching their conclusions than linguists do.

Bayes and the child-as-little-linguist hypothesis make the same mistake. LADs are entirely unlike linguists, and a good thing too. The difference is that the LADs hypothesis space of possible Gs given PLD is very small. In other words, they inhabit a small linguistic world. Linguists qua scientists do not. The aim of linguistics is to figure out what this small world looks like. Sadly, linguists’ access to this small world is not aided by the same built in advantages that the LAD comes equipped with, so it’s standard inference to the best explanation for GGers. That’s why for us the process is slow and laborious while kids acquire their Gs roughly by the age of 5.

So, both Bayes and the child-as-little-linguist hypothesis mis-describe the main acquisition problem by assimilating it to a rational decision problem thereby abstracting away from the critical features of situation: rich small hypothesis space, sparse, misleading and uninformative data, and, quite likely, no optimizing (see CY p 23 on the principle of sufficiency).

Let me put this one more way: the inferences that LADs make in acquiring their Gs are not rational. They consider too few possible hypotheses and utilize very little data in getting to the G they choose. True LADs use data and choose among Gs but the problem is not really like an ideal inductive procedure. Moreover, the ways that it is not like an ideal inductive procedure are what allow the whole procedure to successfully apply. It is precisely by radically narrowing the hypothesis space that it is possible for the LAD to choose the “right” G despite lousy data. The process, it seems, need not be rational to be effective. Just the opposite one might say.

Fourth, CY has a very good discussion of the likely irrelevance of the subset principle to acquisition. This is important (and maybe I will return to it sometime in the future) because the fact that Bayes can derive the subset principle is taken to be a feather in its cap. Roughly speaking, if Bayes then subset principle and because subset principle therefore good for Bayes. But, CY argues that there is good reason to think that the subset principle is not operataive in acquisition. If so, no argument for Bayes.

Note that this is interesting regardless of whether Bayes is right. The subset principle is often invoked so seeing how problematic it is has its own charms. BTW, one of the big problems with the subset principle concerns its tractability. It is computationally very complex as CY notes. If so, if Bayes derives it, it might not be good news for Bayes.

Ok, enough. Run and get CY. It’s an excellent paper and very important. If nothing else, I hope that it focuses the debate over Bayes. The problem is not the details, but the idealization. The Bayes stance runs against what linguists believe to be the basic lay of the cognitive land. If we are right, then they are likely wrong. This is a place where the different starting points are worth sharpening. CY shows us how to do this, and that’s why it is such an important paper.

[1] See the following: Valian, V., Winzemer, J., & Erreich, A. (1981). A little linguist model of syntax learning. Language acquisition and linguistic theory, 188-209.

16 comments:

Noah MotionApril 9, 2016 at 3:51 PM
I think it's worth keeping in mind that nothing about Bayes' rule requires or depends on a large hypothesis space. Indeed, a common illustration of Bayes' rule (and the influence of low base rates) consists of the estimation of the probability of having a disease given a positive diagnosis. This situation assumes two states of the world (diseased or not) and two possible values in the data (positive or negative test result), and it is about as unambiguous an illustration of the utility of Bayesian reasoning as there is.

Suppes is right that Bayes' rule alone won't get us very far without a lot of additional work on what gives structure to the hypothesis space, but this is consistent with Bayes' rule being (or not being) a good rule for deciding among the available hypotheses.

This is not meant as a criticism of CY's paper, since it is appropriate for him to argue against the commonly employed combination of a large hypothesis space and Bayes' rule. It's just to point out that neither requires the other.
ReplyDelete
Replies
UnknownApril 11, 2016 at 4:26 AM
The following two phrases in Norbert's comment seem to me to sum up Bayesian inference well: "data is relevant to hypothesis choice"; and "stats matter in acquisition". Bayes' law is just a simple and reasonable way to implement these two principles.

If you have evidence that the hypothesis space needs to be small, go ahead and perform Bayesian inference on that small hypothesis space. If you don't think that the hypothesis space is small but you're worried that standard inference procedures require more memory or computing resources than a human might have (though I'm not sure there's an easy way to quantify that), you could try to see if humans are using some cheap approximation of Bayesian inference. There's been some interest in using particle filters to model a learner that entertains a limited number of hypotheses at a given moment.

Of course, it might be the case that humans just don't perform Bayesian inference at all or in a particular task (e.g., if they ignore some of the evidence altogether, or ignore its frequency, etc), but that seems like an empirical question that's better settled by simulations rather than a priori arguments.
ReplyDelete
Replies
Alex ClarkApril 11, 2016 at 10:10 AM
@Norbert: "And everyone has always believed that stats matter."
If by stats you mean the number of times that things happen, then I don't think this is true. For example, the Aspects model is not sensitive to the stats, nor is, for example, Jeff Heinz's various phonological learners.
ReplyDelete
Replies
BloggerDecember 24, 2016 at 8:23 PM
This comment has been removed by a blog administrator.
ReplyDelete
Replies

Add comment