Thursday, April 7, 2016

Yang on Bayes 2

Here is part 2. I have included the previous paragraph so that you can get a running start into the discussion. Let me remind you that Charles is culpable for the content of what follows only insofar as reading his stuff stimulated me to think as I did. In other words, very culpable.

It is worth noting that Bayes makes these substantive assumptions for principled reasons. Bayes’ started out as a normative theory of rationality. Bayes was developed as a formalization of the notion “inference to the best explanation.” In this context the big hypothesis space, full data, full updating, optimizing rule structure noted above are reasonable for they undergird a the following principle of rationality: choose that theory which among all possible theories best matches the full range of possible data. This makes sense as a normative principle of inductive rationality. The only caveat is that Bayes seems to be too demanding for humans. There are considerable computational costs to Bayesian theories (again CY notes these) and it is unclear how good a principle of rationality is if humans cannot apply it. However, whatever its virtues as a normative theory, it is of dubious value in what L.J. Savage (one the founders of modern Bayesianism) termed “small world” situations (SWS).

What are SWSs? They are situations where the hypothesis space, the space of options over which the probabilities are defined, is small. In SWSs the central problem is to describe the structure of the “small world.” All else is secondary. So in SWSs useful idealizations will help focus on these structures. And that is the problem with Bayes. As Pat Suppes notes (CY quotes from this paper p. 47, my emphasis NH)):

…any theory of complex problem solving cannot go far simply on the basis of Bayesian decision notions of information processing. The core of the problem is that of developing an adequate psychological theory to describe, analyze and predict the structure imposed by organisms on the bewildering complexities of possible alternatives facing them. The simple concept of an a priori distribution over these alternatives is by no means sufficient and does little toward offering a solution to any complex problem.

…understanding the structures actually used is important for an adequate descriptive theory of behavior….As…inductive logic comes to grips with more realistic problems, the overwhelming combinatorial possibilities that arise in any complex problem will make the need for higher-order structural assumptions self-evident.

In short, the hard problem is to find the structure of the SWSs (to describe the restricted hypothesis space) and Bayes brings little (nothing?) of relevance to the table for solving this problem. Bayes does not prevent considerations of this problem, but it does not promote them or make them central concerns either. And I would argue that by not recognizing (or, worse, radically deemphasizing) the SWS character of most psychological processes, Bayes deflects attention from the hard problem, the one that must be solved if there is to be any progress, onto secondary concerns. That’s the standard cost of a misidealization; it leads you to look in the wrong place.

Let me put this another way, echoing Suppes and CY. The hard problem is “the overwhelming combinatorial possibilities that arise in any complex problem.” The Bayes idealization abstracts away from this at every step: its default assumptions are big hypothesis spaces, large data, panoramic update and optimization. Each of these causes problems. If the cognitive reality (at least in language, but I suspect everywhere) is small worlds, very limited data, myopic (and dumb) updating and satisficing decision then Bayes is just misleading. Moreover, starting where Bayes does means that the bulk of the work will not be done by the Bayes part of any account, but by the algorithms that massage these assumptions out. And indeed, this is what we find when Bayesians respond to their critics. Here’s an example of this move that CY discusses.

CY notes that when confronted with problems Bayesians usually go Marrian. So, for example, when it is observed that Bayes is inconsistent with probability matching, the retort is that this is only so for the level 1 theory. The level 2 algorithms allow for probability matching when certain further constraints are added to the optimizing function. However, if the critique of Bayes is that the idealization is the wrong one, then the fact that one can patch up the problem by saying more is not so much a sign of health but possible confirmation of the initial wrong turn. Of course you can patch anything up with further machinery. This is never in doubt (or almost never). The question is not whether this is possible, but whether the patching up is consistent with the spirit of the initial idealization. If it is, then no problem. If it isn’t, then it argues against the idealization. CY argues that how Bayes handles probability matching goes against the Bayesian spirit of the initial idealization (see the CY discussion of Steyvers et al p 12).

There are many critiques of Bayes that go along the same lines. Many note that Bayes stories have an ad hoc flavor to them (see here and references in linked paper). One critic, Clark Glymour (who, BTW, is no schnook (see here)), has described the work as “Ptolemaic” (see here) and not in a good way (is there a good way?). In the cases discussed, the ad hoc flavor comes from specific assumptions added to the basic Bayes account to accommodate the recalcitrant data, or so the critics argue. It is above my pay grade to evaluate whether the critics are correct in each case. However, this is what we expect if the basic Bayes account is based on a misidealization of the relevant problems.

In fact, this is precisely how we now understand the failure of Ptolemaic astronomy. Think epicycles. Why the epicyles? Because the whole Ptolemaic project starts from two completely wrong idealizations; that the universe is geocentric and that orbits are circular. Starting here, the astronomical data demands epicycles and equants and all the other geometric paraphernalia. Change to a heliocentric system and allow elliptical orbits and all this apparatus disappears. The ad hoc apparatus is inevitable given the initial idealization of the problem.  Given the starting assumptions it is possible to “model” all planetary motion (in fact, we now know that any possible set of orbits can me modeled). However, it is equally clear that in this case tracking the data is not a virtue. Wrong idealization, wrong theory no matter how well it can be made to fit the data.

Let me end this (again) overly long post, with a few observations.

First, arguing against an idealization is very difficult. Why? Because it is often possible to “fix” the problems that a misidealization generates by adding further bells and whistles. Thus, it is not generally possible to argue against an idealization empirically. Rather the argument is necessarily more subtle: the accounts delivered are non explanatory, they involve too much ad hoc machinery, they are too complex, they don’t fit well with other things we want etc. Despite their difficulty, critiques of idealizations are one of the most important kinds of scientific arguments. They are hardly ever dispositive, but they are extremely important precisely because they expose the basic lay of the land and identify the hard problems that need solutions. They are also the kinds of arguments that can make you a scientific immortal (think Newton on contact mechanics and Einstein in the aether).

Second, CY presents an interesting Marr like argument against Bayes. It proposes that level 1 theories that have transparent relations to level 2 accounts are better than those that don’t (see the prior three posts on Marr for some discussion). CY argues that Bayes level 1 and 2 theories are conceptually caliginous because of the initial misidealizaation. Starting from the wrong place will make it hard to transparently relate level 1 computational theories with level 2 theories of algorithms and representations. CY makes such an argument as follows.

It is well known that Bayes procedures taken transparently are computationally intractable. CY observes that to date, even the non-transparent ones that have been proposed are pretty bad (i.e. very slow). This is not a surprise in a Marrian setting if the idealization pursued by Bayes is wrongheaded. Indeed, in a Marr setting one might argue that not being able to generate “nice” algorithms (transparent ones being particularly nice) is a sign that the computational level theory is heading in the wrong direction. The fact that CY’s proposed algorithm is both tractable and easily implementable and that Bayesian proposals are not is a strong argument against the adequacy of the latter idealization in a Marr setting given the Marrian desideratum that computational level theories inform and constrain level 2 algorithmic theories.

CY’s argument goes further. CY’s proposed theory provides a dynamics for acquisition: it not only can tell you how the data determines the G you end up with, but it also tells you how to traverse the set of alternatives available as the data comes in. So, the theory tells you how hypotheses change over real time. Bayes does not do this. It just specifies where you end up given the data available. Bayes is iterative, but not incremental. CY’s theory is both.

Interestingly, CY derives this dynamic incrementality via a transparency assumption regarding the Elsewhere Principle. This is something that a Bayes account cannot possibly achieve precisely because it is well known to be intractable when understood transparently. In other words, CY can do more than Bayes because its understanding of the level 1 problem allows for a very neat/transparent mapping to a level 2 account. This is unavailable for Bayes because the transparency assumption in a Bayes theory leads to intractable computations. So, wrong idealization, less transparency and so no obvious dynamics.

Third, the Bayes idealization embodies a confusion linguists should be familiar with. Remember the child as little linguist hypothesis?[1] It’s the idea that the way to think of how LADs acquired Gs is on a par with how linguists discover the properties of particular Gs. This, we know, is wrong on at least two grounds: first, linguists when they operate consider a wide range of theoretical alternatives for any given "problem" and second linguistic consider full linguistic data to solve their problems while children only consider a small subset of potentially useful data, the PLD. Thus, in two important ways, children inhabit a “smaller world” than linguists do and consider much scantier data in reaching their conclusions than linguists do.

Bayes and the child-as-little-linguist hypothesis make the same mistake. LADs are entirely unlike linguists, and a good thing too. The difference is that the LADs hypothesis space of possible Gs given PLD is very small. In other words, they inhabit a small linguistic world. Linguists qua scientists do not. The aim of linguistics is to figure out what this small world looks like. Sadly, linguists’ access to this small world is not aided by the same built in advantages that the LAD comes equipped with, so it’s standard inference to the best explanation for GGers. That’s why for us the process is slow and laborious while kids acquire their Gs roughly by the age of 5.

So, both Bayes and the child-as-little-linguist hypothesis mis-describe the main acquisition problem by assimilating it to a rational decision problem thereby abstracting away from the critical features of situation: rich small hypothesis space, sparse, misleading and uninformative data, and, quite likely, no optimizing (see CY p 23 on the principle of sufficiency).

Let me put this one more way: the inferences that LADs make in acquiring their Gs are not rational. They consider too few possible hypotheses and utilize very little data in getting to the G they choose. True LADs use data and choose among Gs but the problem is not really like an ideal inductive procedure. Moreover, the ways that it is not like an ideal inductive procedure are what allow the whole procedure to successfully apply. It is precisely by radically narrowing the hypothesis space that it is possible for the LAD to choose the “right” G despite lousy data. The process, it seems, need not be rational to be effective. Just the opposite one might say.

Fourth, CY has a very good discussion of the likely irrelevance of the subset principle to acquisition. This is important (and maybe I will return to it sometime in the future) because the fact that Bayes can derive the subset principle is taken to be a feather in its cap. Roughly speaking, if Bayes then subset principle and because subset principle therefore good for Bayes. But, CY argues that there is good reason to think that the subset principle is not operataive in acquisition. If so, no argument for Bayes.

Note that this is interesting regardless of whether Bayes is right. The subset principle is often invoked so seeing how problematic it is has its own charms. BTW, one of the big problems with the subset principle concerns its tractability. It is computationally very complex as CY notes. If so, if Bayes derives it, it might not be good news for Bayes.

Ok, enough. Run and get CY. It’s an excellent paper and very important. If nothing else, I hope that it focuses the debate over Bayes. The problem is not the details, but the idealization. The Bayes stance runs against what linguists believe to be the basic lay of the cognitive land. If we are right, then they are likely wrong. This is a place where the different starting points are worth sharpening. CY shows us how to do this, and that’s why it is such an important paper.



[1] See the following: Valian, V., Winzemer, J., & Erreich, A. (1981). A little linguist model of syntax learning. Language acquisition and linguistic theory, 188-209.

16 comments:

  1. I think it's worth keeping in mind that nothing about Bayes' rule requires or depends on a large hypothesis space. Indeed, a common illustration of Bayes' rule (and the influence of low base rates) consists of the estimation of the probability of having a disease given a positive diagnosis. This situation assumes two states of the world (diseased or not) and two possible values in the data (positive or negative test result), and it is about as unambiguous an illustration of the utility of Bayesian reasoning as there is.

    Suppes is right that Bayes' rule alone won't get us very far without a lot of additional work on what gives structure to the hypothesis space, but this is consistent with Bayes' rule being (or not being) a good rule for deciding among the available hypotheses.

    This is not meant as a criticism of CY's paper, since it is appropriate for him to argue against the commonly employed combination of a large hypothesis space and Bayes' rule. It's just to point out that neither requires the other.

    ReplyDelete
    Replies
    1. You are quite right. Never throw babies out with bathwater and make distinctions where useful. So thx for the clarification. As I noted, Bayes has 4 parts, only one of which concerns SWSs. And you are right, maybe the Bayes update rule is the right one in SWSs.

      However, we have some reason for skepticism is Charles is right. First, he argues against an optimizing rule and he argues against the idea that we update ALL hypotheses even in a small space. Recall the Trueswell, Gleitman et al papers on word learning. This would suggest that the right rule does not update ALL alternatives in the SMALL hypothesis space and choose the BEST. But if not, then it is not merely the SWS that is at issue. If I understand Charles (always a big if given my inadequacies in these areas) then no part of the Bayes idealization is worth retaining.

      So, yes, we should be free to pick and choose which part of the idealization to maintain. But right now, from what I can see, it's the whole iguana that's up for grabs not just the SWS assumption.

      Delete
    2. No, I think you're missing the point. The point is that this sequence of statements rests on a false premise.

      "the ways that it [real LA] is not like an ideal inductive procedure are what allow the whole procedure to successfully apply. It is precisely by radically narrowing the hypothesis space. . ."

      The false premise is that ideal inductive procedures do not have radically narrowed hypothesis spaces. Those two things are orthogonal. If by "the whole iguana" you mean Bayesian inference, and part of the whole iguana is a rich hypothesis space, then that's a misunderstanding of what Bayesian inference is.

      Delete
    3. I may be missing the point, but here is what I mean. Bayes is built to reflect an ideal of rationality. The ideal suggests that you consider all possible hypotheses and all possible data. I have no idea what you intend by "ideal inductive procedure", but the ideal that underlies Bayes historically is a normative ideal of rationality, hence the wide net wrt hypothesis space and data and update rule.

      Now that is an ideal I can understand, as did Savage, and Suppes for example. Hence their remarks wrt SWSs. I can also see how this ideal of rationality might be empirically relevant as well. Imagine that learning really did involve relatively large hypothesis spaces and updating of all of these wrt to all incoming data leading to the choice of the best one (this after all could be true in roughly the way the idealization that the set of sentences is infinite is true). Were this correct, then normative Bayes would be a reasonable empirical hypothesis. But, I think, and you seem to be agreeing, that this is NOT a reasonable empirical assumption regardless of its normative value.

      Ok,so let's concede this and look for another sense of ideal. Ok, what is it? I dunno, but let's say that there is one in and in this one hypothesis spaces are small (SWSs in Savage's sense). Then we can still ask what of the other Bayes idealizations, say wrt the update function (is EVERY hypothesis updated wrt to new data and then evaluated with respect to the choice rule (choose the one with the BEST coverage) or are only SOME (ONE?) evaluated. Is the choice to choose the BEST or its it something else, as Yang suggests a satisfying function. All of this is also part of the Bayes rationality assumption. These too are motivated on normative grounds (i.e. as part of an explication of what inference to the best explanation would be).

      Of course, one might concentrate on ONLY one part of the system or another. That's fine. Bayes rule but in SWSs is another idealization. Or Bayes rule but only updating 1/2 the hypotheses in SWSs. Or ...But what Charles' paper is arguing is that there is NO part of the idealization that is useful. If he is right, and I find it convincing for the language case, then Bayes does nada.

      So, here's my question to you: how does Bayes think about the induction problem. Or more accurately, given that all agree that data is useful in finding the right hypothesis, what is Bayes distinctive understanding of this process? What makes Bayesian inference special? It must be more than the trivial observation that data is relevant to hypothesis choice. Or that stats matter in acquisition. So what does Bayes bring to the table?Oh, btw, don't talk about subset principles either, for Charles does a good job of showing that this does little or any real work.

      Delete
    4. I would agree that the talk about the optimality of Bayes is misdirected, because, although it is reasonable to think that nature has the correct statistics for classic survival problems such as telling whether that shape partially obscured by the bush is a harmless (and perhaps tasty) deer or a dangerous leopard, no such thing is true for language, which at any given time is whatever the current generation learned at as being, regardless of what their parents were doing (creolization is the paradigm case where the learner is evidently *not* sound for the pidgin data it gets presented with, but produces something else, presumably dictated by the Prior a.k.a. UG).

      For me the charm is that it is conceptually an extremely simple completion of the Evaluation Metric that solves the problem of the simplest grammar being S->W*. Finding the correct grammar is clearly a very hard problem (easier as UG gets tuned properly), but the original EM was not supposed to be a practical discovery procedure either, but, I think, a sort of conceptual compass to indicate the direction in which you want your theory to develop.

      The old answer was 'to be as short as possible so as to account for the data', which became unviable once it became clear that negative data wasn't used very much, or maybe even at all; the new one would be 'to get the best combined grammar and data score'. This latter has some practical aspects which are I think useable even in elementary syntax teaching, such as explaining why, if the 14 sentences in unkown language X in the assignment you've handed out all have SOV word order, your grammar should mandate that. The way I actually did put in teaching was that you want your grammar to provide as few ways of saying anything as possible, given the data, but also not to be too complicated (a kind of balancing act), and the kids did seem to get the idea.

      This is not supposed to imply that it is wrong to try to dive deeper into the computational details and find methods that actually acquire grammars or parts of grammars, and it also doesn't mean that Norbert isn't correct in thinking that at least some of the Bayesian advocates have their hearts at least somewhat in the wrong place (I suspect that some of them do, just like Norbert says).

      But these considerations don't alter my perception that Bayes provides a simple justification for continuing to try to tune UG so that it provides relatively simple grammars for the sorts of NL grammars that we seem to need and either none at all or excessively complex ones for the ones we don't, a project that seems to me to have been not quite properly set on its tracks for a fairly long time.

      Delete
  2. The following two phrases in Norbert's comment seem to me to sum up Bayesian inference well: "data is relevant to hypothesis choice"; and "stats matter in acquisition". Bayes' law is just a simple and reasonable way to implement these two principles.

    If you have evidence that the hypothesis space needs to be small, go ahead and perform Bayesian inference on that small hypothesis space. If you don't think that the hypothesis space is small but you're worried that standard inference procedures require more memory or computing resources than a human might have (though I'm not sure there's an easy way to quantify that), you could try to see if humans are using some cheap approximation of Bayesian inference. There's been some interest in using particle filters to model a learner that entertains a limited number of hypotheses at a given moment.

    Of course, it might be the case that humans just don't perform Bayesian inference at all or in a particular task (e.g., if they ignore some of the evidence altogether, or ignore its frequency, etc), but that seems like an empirical question that's better settled by simulations rather than a priori arguments.

    ReplyDelete
    Replies
    1. Tal, if this is what you take Bayes to be then there is very little there there. In fact, I would say nothing at all given standard background assumptions. Everyone has always believed that data/input matters to virtually ANY cognitive acquisition/deployment. And everyone has always believed that stats matter. The only question is which data, which hypotheses, how use data (update rule) and how choose. These four questions are what distinguish theories. If Bayes is silent on THESE questions then there is no theory at all. Nothing. Nada. Zilch.

      I have taken another view of matters. That Bayes intends to say something substantive. That it bases its substantive suggestion on an idealization of the problems of learning in most cognitive domains. The idealization leverages a normative view of rationality which, in turn, suggests four principles: wide hypothesis spaces, usage of all potentially relevant data, update of all alternatives and a choice rule that optimizes. This is a theory. The question is not whether it is true, but whether the idealization it urges is true enough. I take Charles paper to argue that it is wrong on every point. SWSs, not much data, very few hypotheses among potential ones updated and a satisfying rule. If Charles is right, then Bayes is a VERY bad idealization.

      In fact, as I noted in the post, the idealization has exactly the same problems as the child-as-little-linguist thesis from a while back. Just as that was a bad idealization of the language acquisition problem and should not be adopted, so too Bayes is and should not be adopted. To argue as you did also suggests that it is a bad idealization, but not because it is wrong, but because it is contentless. Do you really want to go there?

      Delete
    2. Yes, I think you're right that "Bayes" isn't really a theory, but rather a set of modeling practices. This recent very sensible paper makes the case for a more practical perspective on Bayesian modeling, which doesn't make any optimality claims. It's a long paper but the first few pages are enough to get the general idea.

      I suspect most Bayesians have taken this practical position implicitly anyway - the manuscript quotes the following from a paper by a formidable cabal of Bayesians: "the hypothesis that people are optimal is not something that even the most fervent Bayesian believes" (Griffiths et al., 2012, p. 421).

      Delete
    3. Elaborating on Tal's original point slightly, Bayes is a "simple and reasonable way to implement" [two obvious ideas], key word "WAY" - it is a PARTICULAR way and it makes some particular assumptions, and some particular empirical predictions. The point is I suppose that you can set the empirical claim aside (as the Tauber et al paper suggests) and just use the thing as a useful statistical tool (which is I think in practice what people are doing anyway). That doesn't mean it's vacuous though, it just means it's flexible enough that you can often ignore the empirical content.

      There are certain points where optimality makes the wrong prediction on the face of it (like probability matching as you/Charles note), but, although I'm kind of behind on the literature, I think the jury's still out on most other predictions of Bayesian inference because they're super hard. There's a lot of indeterminacy because the true prior is a hidden variable but I don't think we've given the question all we've got by any means. But you don't give up on the interesting Minimalist project of explain humans' deviation from optimal inference just because that problem is hard.

      Delete
    4. I suspect there's quite a bit of indeterminacy, because you can come up with hypothesis spaces, priors and likelihood functions that would make a lot of human behaviors seem optimal.

      E.g., a lot of the Bayesian work on inductive generalization assumed that when a particular data point supports both a narrow and a broad hypothesis, the rational thing to do is to take that point to be supporting the narrow hypothesis more than the broad one, perhaps even proportionally to the ratio of class sizes (the size principle). But that's just one possible likelihood function. You could easily think of many different ways to do this, e.g. consider that data point to be uninformative. And it turns out that you can model different people's generalization behavior by interpolating differently between the two extremes I just mentioned (Navarro et al 2012).

      That's just one example. You can probably find others in the critics' papers. In general there seems to be enormous leeway for creativity in choosing the components of the model, and I don't have a good sense of the limits of that creativity. But choosing the components of the models is exactly what linguistics / cognitive science is about, so I'm not sure why the anti-Bayesian crowd seems to believe it's such a bad thing.

      Delete
    5. I have a new post coming out today that tries to take this reply on. The gist is that this is a big step down from the earlier advertisements for Bayes. It actually concedes that the critics were right all along. It's "just" a modeling technique. Anything to recommend it as such? Well, it appears that the answer is no. Mostly sound and fury.

      Let me add, that it is impossible to argue scientifically against a notation. It is only possible to argue against claims. One can argue against a program by arguing that its leading ideas (idealizations) are not fecund. One can argue against a theory by showing its predictions are wrong. Once cannot argue against the decision to do linguistics in French rather than in German. On the interpretation you and Tal are advancing, Bayes is not even a matter of taste for it has no serious content. Fine with me if we all agree on this point.

      Delete
  3. @Norbert: "And everyone has always believed that stats matter."
    If by stats you mean the number of times that things happen, then I don't think this is true. For example, the Aspects model is not sensitive to the stats, nor is, for example, Jeff Heinz's various phonological learners.

    ReplyDelete
    Replies
    1. The Aspects model did not mention stats but was perfectly happy to incorporate them. It abstracted away from the problem, it was not hostile to it. The proposed idealization was not inimical to a stats addition.

      As For Heinz, I am sure that he thinks that stats matter ins one areas of cognition. Stats doesn't matter everywhere. But all agree that it matters somewhere. Is this the Bayes claim? That it matters everywhere? That no problem however posed can be advanced without stats? Does anyone think this? Nope. It's the weak claim: stats matter somewhere. Well duh!

      Delete
    2. I disagree: the Aspects idealization is incompatible with the use of statistics of the input data. You can of course change it so that it uses statistics.

      My point here is just to argue against the claim the the Bayesian claims are vacuous: by giving some examples of models of language acquisition (for that is the topic) which are indubitably non Bayesian.

      Delete
    3. Ok we disagree. The idealization is not incompatible with stats. In fact, the utility of stats was something long recognized going back to LSLT. But this is a secondary issue. Is it your view that the advance that Bayes makes is that it advocates the use of stats? That's its intellectual contribution? That's the revolution? So before that nobody thought of using stats in cognition? Or, using stats was controversial? Is that your point? Because if it is not, then it seems to me largely besides the point. Or, to be more generous: if that's the revolution then it's been won long ago and we can move onto the interesting questions.

      Delete
  4. This comment has been removed by a blog administrator.

    ReplyDelete