Sunday, February 22, 2015

Bayes and Marr and conceptual motivation

There is a Bayes buzz in the psycho/cog world. Several have been arguing that Bayes provides the proper “framework” for understanding psychological/cognitive phenomena. There are several ways of understanding this claim. The more modest one focuses on useful tools leading to specific analyses that enjoy some degree of local justification (viz. this praises the virtues of individual analyses based on Bayes assumptions in the usual way (i.e. good data coverage, nice insights)). There is also a less modest view. Here Bayes enjoys a kind of global privilege (call this ‘Global Bayes’ (G-Bayes)). On this view, Bayesian models are epistemically privileged in that they provide the best starting point for any psychological/cognitive model. This assumption is often tied together with Marrian conceptions of how one ought to break a psycho/cog problem up into several partially related (yet independent) levels. It’s this second vision of Bayes that is the topic of this post. A version is articulated by Griffiths et. al. here. Many contest this vision. The dissidents’ arguments are the focus of what follows. Again, let me apologize for the length of the post. A lot of the following is thinking out loud and, sadly, I tend to ramble when I do this. So if this sort of thing is not to your liking, feel free to dip in and out or just ignore. Let’s begin.

Cosma Shalizi, (here) blogs about a paper by Eberhardt and Danks (E&D) (here) that relates to G-Bayes.  Shalizi’s blog post discusses E&D’s paper, which focuses on the problem that probability matching poses for Bayesian psycho theories.  The problem E&D identifies is that most of the empirical literature (which E&D surveys) ends with the subject pool “probability matching the posterior.”[1] Here’s the abstract:

Bayesian models of human learning are becoming increasingly popular
in cognitive science. We argue that their purported confirmation largely relies on a methodology that depends on premises that are inconsistent with the claim that people are Bayesian about learning and inference. Bayesian models in cognitive
science derive their appeal from their normative claim that the modeled inference is in some sense rational. Standard accounts of the rationality of Bayesian inference imply predictions that an agent selects the option that maximizes the posterior expected utility. Experimental confirmation of the models, however, has been claimed because of groups of agents that ‘‘probability match’’ the posterior. Probability matching only constitutes support for the Bayesian claim if additional unobvious and untested (but testable) assumptions are invoked. The alternative strategy of weakening the underlying notion of rationality no longer distinguishes
the Bayesian model uniquely. A new account of rationality—either for inference or for decision-making—is required to successfully confirm Bayesian models in cognitive science.

There are replies to this (here) by Tenenbaum & friends (T&F) and a reply to the reply (here) by Icard, who I think is (or was) a student of Danks’ at CMU.  The whole discussion is very interesting, and, I believe, important.

Here’s how I’ve been thinking about this translated into linguistiky terms. Before getting started, let me say up front that I may have misunderstood the relevant issues. However, I hope that others will clarify the issues (viz. dispel the confusion) in the comments. So, once again, caveat lector!

Griffiths et.al. describes Bayesian models as follows. They are intended as “explanations of human behavior.” The explanations are “teleological” in that “the [Bayesian-NH] solutions are optimal” and this optimality “licenses an explanation of cognition in terms of function.” The logic is the following: “the match between the solution and human behavior may be why people act the way they do” In other words, a Bayesian model intends to provide an optimal solution to “problems posed by the environment” with the assumption that if humans fit the model (i.e. act optimally) then the reason they do so is because this is the optimal solution to the problem. Optimality, in other words, is its own explanation. And it is the link between optimality and explanation that lends Bayes models a global kind of epistemic edge, at least for Marr level-1 descriptions of the computational problems that psycho theories aim to explain.

Curiously, Bayesians do not seem to believe that people actually do act/cognize optimally.[2] As T&F explains, Bayesian accounts do “not imply a belief that people are actually computing these optimal solutions” (p. 415).[3] Why not? Because it is recognized that such computations are intractable. Therefore, assuming that the models are models of what people do is “not a viable hypothesis.” How then to get to the actual psychological data points? To get from the optimal Bayesian model to the witnessed behavior requires the addition of “approximate algorithms that can find decent solutions to these problems in reasonable time” (p. 415-6). This seems to suggest that subjects’ behavior is approximately optimal.

In other words, optimal Bayes considerations sets the problems to be solved by more tractable algorithms, which in fact do a pretty good job of coming close-ish to the best solution given resource constraints. So, what faces the tribunal of actual psychological evidence is the combination of (i) a Bayesian optimal solution and (ii) a good enough algorithm that approximates this optimal solution given other computational constraints. So, more specifically, in cases where we find individuals probability matching rather than all converging on the same solution (as the Bayes model would predict) the algorithm does the heavy lifting. Why? Because a Bayes account all by itself implies that all participants if acting in a rational/optimal Bayes manner should all make the same choice; the one with the highest posterior. So Bayes accounts need help in order to make contact with the data in such cases, for Bayes by itself is inconsistent with probability matching results.[4] Here’s Shalizi making this point:

Here's the problem: in these experiments (at least the published ones...), there is a decent match between the distribution of choices made by the population, and the posterior distribution implied plugging the experimenters' choices of prior distribution, likelihood, and data into Bayes's rule. This is however not what Bayesian decision theory predicts. After all, the optimal action should be a function of the posterior distribution (what a subject believes about the world) and the utility function (the subjects' preferences over various sorts of error or correctness). Having carefully ensured that the posterior distributions will be the same across the population, and having also (as Eberhardt and Danks say) made the utility function homogeneous across the population, Bayesian decision theory quite straightforwardly predicts that everyone should make the same choice, because the action with the highest (posterior) expected utility will be the same for everyone. Picking actions frequencies proportional to the posterior probability is simply irrational by Bayesian lights ("incoherent"). It is all very well and good to say that each subject contains multitudes, but the experimenters have contrived it that each subject should contain the same multitude, and so should acclaim the same choice. Taking the distribution of choices across individuals to confirm the Bayesian model of a distribution within individuals then amounts to a fallacy of composition. It's as though the poet saw two of his three blackbirds fly east and one west, and concluded that each of them "was of three minds", two of said minds agreeing that it was best to go east.

Now for a linguistics analogue: Acceptability data is what we largely use to adjudicate our competence theories.[5] Now, we recognize that acceptability and grammaticality do not perfectly overlap, there being some cases of grammatical sentences that are unacceptable and some cases of acceptable sentences being ungrammatical. However, this overlap is marginal (at least in a pretty big domain).[6] What E&D argues (and Shalizi emphasizes) is that this is false in the typical case of Bayesian psychological accounts. Typically the Bayes solution does not provide the right answer for in many cases we find that the tested population probability matches the posterior distribution of the Bayes model. And this is strongly inconsistent with what Bayes would predict. Consequently, what allows for the fit with the experimental data is not the Bayes vision of the computational problem but the added algorithm. In effect, it would be as if our competence theories mostly failed to fit our acceptability profiles and we explained this mismatch by adding parsing theories that took in the slack. So, for example, it would be as if our Gs allow free violations of binding principle A but our parsers say that such violations, though perfectly grammatical, are hard to parse and this is why they sound bad. 

Let me be clear here: syntacticians do in fact make such arguments. Think of what we say about the unacceptability of self-embedded sentences (e.g. ‘That that that Bill left annoyed Harry impressed Mary’). However, imagine if this were the case in general. I think that syntacticians would begin to wonder what work the competence theory was doing. In fact, I would bet that such a competence theory would quickly find its way to the nearest waste basket. This is effectively the point that Icard makes in his reply to T&F. Here’s the money quote:

It is commonly assumed that a computational level analysis
constrains the algorithmic level analysis. This is not always
reasonable, however. Sometimes, once computational costs
are properly taken into account, the optimal algorithm looks
nothing like the ideal model or any straightforward approximation
thereto” (p. 3) [my emphasis, NH].

This raises a question, which Shalizi (and many others) presses home, about the whole rationale behind Bayesian modeling in psychology. Recall, the explanatory fulcrum is that Bayes models provide optimal solutions, as it is this optimality that licenses teleological explanations of observed behavior. Shalizi argues that the E&D results challenge this optimality claim.

By hypothesis, then, the mind is going to great lengths to maintain and update a posterior distribution, but then doesn't use it in any sensible way. This hardly seems sensible, let alone rational or adaptive. Something has to give. One possibility, of course, is that is sort of cognition is not "Bayesian" in any strong or interesting sense, and this is certainly the view I'm most sympathetic to…

In other words, Shalizi and Icard are asking in what sense the Bayes level-1 theory of the computation is worth doing given that it’s description of the problem seems not to constrain the algorithmic level-2 account (i.e. Marr’s envisioned fertile link between levels seems to be systematically broken here). If this is correct, then it raises Icard’s question in a sharp form: in what sense is providing a Bayesian analysis even a step in the direction of explaining the psychological data? Or to put this another way: what good is the Bayesian computational level theory if E&D and Shalizi and Icard are right? Things would not be bad if most of the time subjects approximated the Bayes solution. What’s bad is that this appears to be the exception rather than the rule in the experimental literature that is used to argue for Bayes. Or so E&D’s literature review suggests is the case.

To fix ideas, consider the following abstract state of affairs: Subjects consistently miss target A and hit target B. One theory is that they are really aiming for A, but are missing it and consistently hitting B because trying to hit A is too hard for them. Moreover, it is too hard in precisely a way that leads to consistently hitting B. This account might be correct. However, it does not take a lot of imagination to come up with an alternative: the subjects are not in fact aiming for A at all.  This is the logic displayed by the literature that E&D discusses. It is not hard to see, IMO, why some might consider this less than powerful evidence in favor of Bayes accounts.

Let me further embroider this point as it is the crux of the criticism. The Bayes discussions are often cloaked in Marrish pieties. The claim is that Bayes analyses are intended to specify the “problem that people are solving” rather than provide a “characterization of the mechanisms by which they might be solving it” (p. 2 Griffiths et. al.). That is, Bayes proposals are level-1, not level-2 theories. However, what makes Marrish pieties compelling is precisely the intimation that specifications of the problems to be solved will provide a hint about the actual computations and representations that the brain/mind uses to solve these problems.  Thus, it is fruitful to indulge in level-1 theorizing because it sheds light on level-2 processes. In fact, one might say that Marr’s dubbing level-1 theories as ‘computational’ invites just this supposition, as does his practice in his book. Problems don’t compute, minds/brains do. However, a computational specification of a problem is useful to the degree that it suggests the magnitudes the brain is computing and the representations and operations that it uses to compute them. The critique above amounts to saying that if Bayes does not commit itself to the position that by and large cognition is (at least approximately) optimal, then it breaks this Marrian link between level-1 and level-2 theory and thereby looses its main conceptual Marrian motivation. Why do a Bayes level-1 analysis if it in no way suggests the relevant brain mechanisms or the variables mental/brain computations juggle or the representations used to encode them? Why not simply study the representations and algorithms directly without first detouring via a specification of a problem that will not shed much light on either?

Here’s one more try at the same point: Griffiths et. al. try to defend the Bayes “framework” by arguing that it is a fecund source of interesting hypotheses. That’s what makes Bayes stories good places to start. But this just seems to beg the question critics are asking; namely why should we believe that the framework does indeed provide good places to start? This is the point that Bowers and Davis (B&D) seem to be making in their reply to Griffiths et. al. here and the one that Shalizi, E&D and Icard are making as well.

Let me provide an analogy with contemporary minimalism. In proposing a program it behooves boosters to provide reasons for why the program is promising (i.e. why adopting the program’s perspective is a good idea). One, of course, hopes that the program will be fecund and generate interesting questions and analyses (i.e. models), but although the proof of a program’s pudding is ultimately in the eating, a program’s proponents are required to provide (non-dispositive) reasons/motivations for adopting the program’s take on things. In the context of MP this is the role of Darwin’s Problem. It provides a rationale for why MP generated questions are worth pursuing that are independent of the fruits the program generates. Darwin’s Problem motivates the claim that the questions MP asks are interesting and worth pursuing because if answered they promise to shed light on the basic architecture of FL. What is the Bayes analogue of Darwin’s problem? It seems to be the belief that considering the properties of optimal solutions to “problems posed by the environment” (Griffiths et. al. p. 1) will explain why the cognitive mechanisms we have are the way they are (i.e. a Bayes perspective will provide answers to a host of why questions concerning the representations and algorithms used by the mind/brain when it cognizes). But why believe this if we don’t also assume that a specification of the computational level-1 problems will serve to specify level-2 theories of the mechanisms? All the critiques in various ways try to expose the following tension: if Bayes optimality is not intended to imply anything about the mind/brain mechanisms then why bother with it and if it is so intended then the evidence suggests that it is not a fruitful way to proceed for more often than not it empirically points in the wrong direction. That’s the critique.

Please take the above with a large grain of salt. I really am no expert in these matters so this is an attempted reconstruction of the arguments as a non-expert understands them. But, I believe that I got the main points right and if things are as Shalizi, E&D, Icard and B&D describe them to be then it stands as a critique of the assumption that Bayes based accounts are prima facie reasonable places to start if one wants to account for some cognitive phenomenon (i.e. it seems like a strong critique of G-Bayes). Or to put this in more Marrian terms: it is not clear that we should default to the position that Bayes theories are useful or fecund Level-1 computational analyses as they do not appear to substantially constrain the Level-2 theories that do all (much) of the empirical heavy lifting. This is the critique. It suggests, as Shalizi puts it, acting like a Bayesian is irrational. And if this is correct, it seems important for it challenges the main normative point that Bayes types argue is the primary virtue of their way of proceeding.

Let me end with two pleas. First, the above, even if entirely correct, does not argue against the adequacy of specific Bayesian proposals (i.e. against local Bayes). It merely argues that there is nothing privileged about Bayesian analyses (i.e. G-Bayes is wrong) and there is nothing particularly compelling about the Bayes framework as such. It is often suggested (and sometime explicitly stated) that Rational Analyses of cognitive function (and Bayes is a species of these) enjoy some kind of epistemologically privileged status due to their normative underpinnings (i.e. the teleological inferences provided by optimality). This is what these critical papers undercut if successful. None of this argues against any specific Bayes story. On its own terms any such particular account may be the best story of some specific phenomenon/capacity etc. However, if the above is correct, a model gains no epistemic privilege in virtue of being cast in Bayesian terms. That’s the point that Shalizi, E&D, Icard and B&D make.  Specific cases need to be argued on their merits, and their merits alone.

Second, I welcome any clarifications of these points by those of you out there with a better understanding of the current literature and technology. Given the current fashion of Bayes story telling (it is clearly the flavor of the month) in various parts of cognition (including linguistics) it is worth getting these matters sorted out. It would be nice to know if Bayes is just technology (and if so justified on application at a time) or is it a framework which comes with some independent conceptual motivation. I for one would love to know.





[1] Griffiths et. al. agree that there is a lot of this in the literature and agree it is a serious problem. They offer a solution based on Tenebaum & friends that is discussed below.
[2] Optimality of cognitive faculties is a pretty fancy assumption. I know because Minimalists sometimes invoke similar sentiments and it has never been clear to me how FL or cognition more generally could attain this perfection. As Bob Berwick never tires of telling me, this is a very fancy assumption in an evolutionary context. Thus, if we are really perfect, this may be more of a problem requiring further explanation than an explanation itself.

Griffiths et. al. seem to agree with this. They tend to agree that human’s are not optimal cognizers nor even approximately optimal cognizers. Nonetheless, they defend the assumption that we should look for optimal Bayes solutions to problems as a good way to generate hypotheses. Why exactly is unclear to me. The idea seems to be the tie in between optimal solutions and the teleological explanations they can offer to why questions. Nice as this may sound, however, I am very skeptical for the simple reason that teleological explanations are to explanations what Christian (and political?) Science is to science. Indeed, these were precisely the kinds of explanations that 16th century thinkers tried to hard to excise. It is hard to see how this functional fit, even were it to exist, is supposed to explain why a certain system has the properties it has absent non-teleological mechanisms that get one to these optimal endpoints.

[3] Note that whether people compute optimal solutions is consistent with the claim that people act optimally. I mention this for the teleological account explains just in case the subjects are acting according to the model. Only then can “the match between the solution and human behavior” teleologically explain “why people act the way they do.” If however, subjects don’t act this way, then it is hard to see how the teleological explanation, such as it is, can gain a footing. I confess that I don’t follow the Griffiths et. al. reasoning here. The only way out I can see is if they assume that subjects do in fact act (approximately) optimally even if they don’t compute using Bayes like algorithms. But even this seems incorrect, as we shall see below.
[4] It is important to note that it is the posterior probability (a number that is the product of prior Bayesian calculation) that is being matched, as Griffith’s et. al. observe.
[5] Actually, acceptability under an interpretation.  Acceptability simpliciter is just the limiting case where some bit of linguistic data is not acceptable under any interpretation.
[6] Why the caveat? If Gs generate an unbounded number of Ss then most will be beyond the capacity for judging acceptability. What we assume is that where an acceptability judgment can be made grammaticality and acceptability will largely overlap.

29 comments:

  1. You might find useful Griffiths, Lieder, and Goodman's forthcoming 'Rational Use of Cognitive Resources: Levels of analysis between the computational and algorithmic,' available on Griffiths' lab's site. It argues for a descendent of Anderson's view, according to which you start with a conception of the problem and try to model the optimal solution making minimal assumptions about computational limitations (a la Marr's computational level); but then when there's a lack of fit with the behavioral data, you go back and refine, possibly -- for example -- you adjust your conception of the problem, or for example (more relevant here) you go back and now make less minimal assumptions about computational limitations. For example, building in the cost of computation yields a level of analysis intermediate between the computational and the algorithmic. Their view is thus similar to Icard's bounded rationality approach. At least that's the basic idea, as I understand it. (Incidentally, Icard was a Stanford PhD, advised by van Bentham, and that paper stems from his dissertation; but he was briefly a post-doc at CMU, where Danks is.)

    ReplyDelete
    Replies
    1. Interesting. I guess the question still stands: what doe doing a Level-1 analysis buy you? If indeed there are few cases of (near) optimal behavior then what does modeling things as such get one. It's always possible to add supplementary assumptions that get you to the data. This is Glymor's point about Ptolomeic Psychology. So for example, let's say people did not probability match the posterior, what would Tenenbaum and friends say then? Or say the match was something else entirely. Couldn't one always add assumptions to get you to the data regardless what the Level-1 theory said? If the answer is yes, then what use is the framework? What work does it do. We can always do this, the question is why bother.

      Thx for the reference. N

      Delete
    2. Why bother is indeed the right question: surely everything can be stated in terms of Turing machines, information theory, quantum mechanics (recall Putnam's round peg and square hole), etc., the question is what insight may be gained by doing so.

      There are specific instances in the current work on language. An influential approach to word learning (Xu & Tenenbaum) is to produce the best Bayesian lexicon over the set of all possible lexicons given the set of "sounds" and "meanings". This is clearly computationally intractable: see Beal and Roberts 2009 http://web.mit.edu/jakebeal/www/Publications/ComplexityCogSci2009.pdf) So approximation methods must be found. A particular approximation algorithm is the so-called Win-stay-lose-switch (http://www.alisongopnik.com/papers_alison/WSLSManuscriptRRFinalInPressLB.pdf), a heuristic strategy from the psychology of learning. But that appears to be the ACTUAL process of word learning, dubbed Propose-but-Verify, as the Trueswell-Gleitman group has shown and modified as Pursuit (http://facultyoflanguage.blogspot.com/2013/03/learning-fast-and-slow-i-how-children.html).

      Delete
    3. I think their answer to *this* "why bother?" question is that it's supposed to provide an answer to the question: *why* do we have such a mechanism? The answer is supposed to be: we have such a mechanism because it provides a (near) optimal solution to the problem posed at the 'computational + resource constraints' level ... and evolution tends towards optimality ... (er....) --Of course this leads to a different objection. For now we just have an evolutionary (or ontogenetic?) just-so story unless evidence can be provided that indeed the mechanism was selected for. (By the way, my comment below, listed as posted at 6:25 (in some time zone), was meant as a reply to Norbert's 2/23 post. My apologies for screwing up how posting to these threads works!)

      Delete
    4. Just my two cents, but I find the "optimality" claims not the most useful way to think about things, and neither the Marr 3-level approach.
      Instead, I prefer the "methodological" view that performing Bayesian Inference is a way of getting directly at the "logical" questions about language acquisition, i.e. the question(s) of what, as a matter of principle, can be inferred by a learner given a precise specification of their inductive biases and the available inputs.

      Why use Bayesian modeling for this? Because it's the best way I am aware of off factoring out the logical from any algorithmic questions; performing Bayesian inference is the straight-forward extension of deduction in the context of uncertainty.

      So why would you want to do this? Well, drawing out the conclusions of specific proposals is part of evaluating them. One way or another, you have to do that anyways. So might as well do it in a principled way.

      Delete
    5. If saying that this is the "best way" is another way of saying that it is what an ideal agent, or one conforming to the norms of rationality, would do, then isn't this just another way of saying you are operating at Marr's computational level, without worrying about algorithmic questions and about resource limitations etc.?

      Delete
    6. I don't see how. There's no Marr-levels for the question whether (and to what degree) something can be inferred from something else given particular assumptions.
      I suspect just me and most people don't really care about the "logical question of language acquisition". Which is fair enough, really; but for me, its providing a principled tool for tackling this question is what made the Bayesian framework attractive.

      Delete
    7. So, where would Chomsky's original evaluation metric fit wrt Bayes and Marr? I see the kind of Bayesian view that I perceive Ben as advocating as a necessary upgrade of Chomsky's idea to accomodate the fact that the available negative evidence is (almost?) entirely indirect? COEM is like Marr level 1 in that it is not algorithmic, unlike Marr level 1 in that is not a solution to a sensibly posed empirical problem (we don't know what the best grammar notation is, nor even that there is such a thing, really, and even less do we know that the fully correct solution to the grammar selection problem can be posed as picking the 'shortest' grammar that appropriately matches the evidence) . And it's not intelligible 'perceptual Bayes' (bats detecting moths and vice versa) for a similar reason: there are no environmentally valid statistics of what the most probable grammar is that it would be plausible to put into the LAD (ie the LAD shouldn't be allowed to know that its local environment has been occupied for the last several thousand years by speakers of agglutinating languages with freeish word order and lots of case-marking, whereas natural selection presumably does allow perception of bats by moths to eventually adapt to the arrival of a new species of bat with different audible characteristics.

      In spite of all this, I think something like the Bayesian/MDL upgrade of COEM is the only way to go with the apparent demise of P&P as a viable idea (perhaps we can take the peculiar restrictions on accusative case-marking in Russian in Pesetsky 2010 as the final nail in the coffin?)

      Delete
    8. This comment has been removed by the author.

      Delete
    9. with the apparent demise of P&P as a viable idea (perhaps we can take the peculiar restrictions on accusative case-marking in Russian in Pesetsky 2010 as the final nail in the coffin?)

      Avery, what are you talking about? What restrictions on accusative case? What demise? Is "Pesetsky 2010" the draft version of my 2014 book? (The section on accusative is one of the most revised sections of the published book, in case that matters.)

      Puzzled.

      Delete
    10. Oops yes except that the electronic copy in our library says 2013 (Russian Case Morphology and the Syntactic Categories); the specific 'rule' is (77) on p. 66 'Russian-specific restriction on assignment of VACC', which seems to me to be too complicated to be a 'parameter'. Of course the distinction between 'parameters' and 'rules' was ever spelled out in a clear way, and 'Principles and Rules' is not necessarily a crazy architectural aspiration ... I think Joan Bresnan was trying to push LFG in that direction before she lost interest in conventional syntactic theory.

      Delete
  2. One might distinguish 2 questions here. First, why start with level 1 (the computational level)? Why not start with level "1.5", where one builds in various resource limitations? (Or should that be level 3 and level 2.5? I forget which direction Marr counts.) The answer might be because starting with level 1 as a baseline, combined with looking at discrepencies with the behaviorial data, gives one insight as to how to proceed at level 1.5. Second, can't one *always* fit the data (your main question)? Sure, but that's an issue generally, not just with this framework. If it all seems ad hoc and unmotivated, with assumptions added merely to save the data, with no surprising predictions generated and then verified, no fruitful new questions posed, no new research directions opened up, no generalizing to new unexpected cases, with no independent motivation or convergence of results from multiple sources, then yes indeed consign it to the flames. But if not .... So, let's see. --That there's a cost to thinking (time costs, metabolic costs, etc. doesn't seem ad hoc.)

    ReplyDelete
    Replies
    1. I think that I expect more out of Marr than perhaps you do. I took the point of the levels to be that stating a problem at level-1 was going to give insight (hints) as to how to proceed at level-2. This seeing the problem as an arithmetic one meant that you were looking for the analogue of numbers (though whether base 2 or 10 or 30 was undetermined) and arithmetical operations (adders) in the basic machinery. This is why it was useful to see the level-1 problem in terms of arithmetic. Similarly with distal geometry. The primitives you were using to compute the distal geometry were ones the problem hinted at (e.g. look or lines, angles, edges etc). This is what is missing from so many Bayes stories. They do not really advance the level-2 investigations. See Charles comment above for example. It appears that getting to the "right" answer" was in no way facilitated by considering the level-1 problem as they posed it. What's this mean? That the level-1 speculations are idle. Go ahead and do this if you want, but don't expect much payoff. This is hardly a rousing recommendation, at least not to me. Were this the best that Marr could do, nobody would have listened.

      The critique as I understand it asks the Bayesians to provide real conceptual motivation for starting with Bayes analyses. They attempt to do this by wrapping the framework up in Marrian pieties (Marrities?). The problem is that many times if the critics are right, they fail to do the legwork Marr did; namely, showing how the level-1 problem helps. In fact, one might go further (Shalizi clearly did), starting with Bayes points you in the WRONG direction. One can recover of course by getting a clever algorithm to backtrack to the right empirical results, but this doesn't obscure the fact (actually it does, but it shouldn't) that you went way out of your way to get to the answer. Hence the question: Why bother? Is it all just that Bayes enjoys a little current fame and prestige? Is this just another case of emperors and wardrobes? If so, then there is nothing wrong with benign fashionable, but nothing right either and it does tend to obscure the basic issues.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. Right, one can certainly get the right level-2 solution without having first considered the level-1 problem. When Bowers and Davis point this out with examples, Griffiths et al. reply (if memory serves) by responding that having all the level-2 answers would still leave lots of why-questions hanging. (But see the worry I note above, in replying to Charles Yang, about the Bayesians proposed way of trying to answer them.) In any event, acknowledging that one can get level-2 answers without having to go via level-1 leaves open the possibility that going via level-1 might be quite fruitful (in addition to connecting up one's level-2 proposal with an alleged answer to the why-question). But the proof, if there is one, of the pudding will be in the eating. I guess the work on approximative algorithms in computer science that they're taking off the shelf wouldn't have been done but for their looking for computationally tractable ways of moving from level-1 to level-2 with their machines. --Well, it's early days for the Bayesian cogsci modelers' attempts to move down (up?) from level 1. Perhaps let a hundred flowers try to bloom? (My outsider's sense is that things are a bit different with Bayesian perception, where there already seems to be a real sense of progress enabled by Bayesian methods.)

      Delete
    4. I think you mistake the critique. It is not that people should NOT do Bayes style research. People should (and do) do anything they hope might work. The question is whether there is something independently attractive about this "framework." And this, I believe, requires some kind of motivation suggesting why one ought to start from Bayes like positions. The leaders of the Bayes movement have in fact tried to provide such motivation. If consists in two parts: Bayes+Marr. Moreover, it seems clear to me that they have managed to do quite well with the selling of this package. The question is whether the framework deserves this popularity or whether like many things it is popular because it is popular. From what I can tell, the arguments given are not particularly compelling. There are independent successes and this is great. But there are independent successes using other assumptions as well. I have been at meeting where people have asked the non Bayesians to, in effect, justify their work in Bayes terms. I have not generally seen the converse. So, there is a perception out there that this is a good place to begin, indeed, the best place to start cog investigations. And this seems, from what I can tell, unsupported by any decent argument.

      So, sure, let a hundred (even a thousand) flowers bloom. But if this is what you wish then the a priori assumption that all good work starts with Bayes needs a little shaking up. Either we need a better defense of the claim, or we should just understand that this is one place among many to begin, with no special privilege. Recall, the argument was against G-Bayes, not local Bayes analyses.

      Thx for pushing back however. It helped me make my thoughts clearer, at least to myself.

      Delete
    5. Thanks Norbert for the further remarks. (And I'm just trying to figure out what I think of their work too.) I guess I had in mind two kinds of reply they might give, which (however unclearly) I had in mind as possible replies to what was bothering you. One was, if it's fruitful locally, it might be worth trying as a method more generally. That's not Marrian. (And it wouldn't justify asking others to justify their work in Bayesian terms!) The second was optimal, adaptationist quasi-Marrianism: Evolution (or perhaps other processes, e.g. ontogenetic) tends to optimize; so look at what would be optimal as a baseline; then, when there are behavioral discrepencies, look at what would be optimal given resource constraints. The optimal adaptationist assumption would provide a general "conceptual" reason for proceeding this way. (And it would answer why we have the mechanisms and processes we do.) It's also of course highly contentious. Maybe less so for, say, early vision? --(I think this is their view, not something I'm foisting upon them. Don't Griffiths et al. have a bit on teleological explanation, cashed out evolutionarily, in their reply to Bowers and Davis?)

      Delete
    6. Maybe. Form what I understand, Evolution does not tend to optimize, though it may occasionally. But, that could be a motivation. I guess I think (and I am not sure you disagree) that this is pretty weak spirits, not enough to get anyone reasonable drunk with the mind set unless already pre-disposed. At least to me, the best argument was the first; we've had success so let's look for more. The problem is that as many have observed the successes really are in the eyes of the beholders and not all agree about this. Marcus has a pretty devastating critique of many of the results, suggesting that there is a lot less there than meets the eye. At any rate, I've benefitted from your push back, so thx. I think however, I may have hit the point where the returns from further thinking may be approaching 0. Thx again.

      Delete
    7. Right, I don't disagree: evolution doesn't optimize, so it's quite an assumption -- but it does seem to be theirs. By the way, Goodman et al. have a reply to Marcus forthcoming in Psych Science (available on his website). --Thanks to you too -- very useful for me (as were also the links in your original post).

      Delete
    8. @Steve Gross: A longer reply re evolutionary pan-adaptionism and its discontents (probably a blog my me) will have to wait. But for now, it might be useful to point out that even the usual 'satisficing' redoubt doesn't really cut much ice -- viz., Sean Rice's comment in his (wonderful) "Evolutionary Theory: Mathematical & Conceptual Foundations" (Sinauer), on p. 37: "Though evolutionary theory is not built on the idea that any quantity is necessarily maximized, the idea that there is such a quantity remains one of the most widely held popular misconceptions about evolution." IMO, absorbing what Rice manages to lay out just in this first chapter is a step along the way in moving from what some have called "vulgar Darwinism" - the kind you read about in airport bookstores or the New Yorker - to a more nuanced, mature view.

      Delete
  3. @Steve Gross: I read through the reply by Griffiths that you point to at the top of this thread - and I cannot find a word in it that counters the Shalizi/Eberhardt & Danks position. I am sure I missed something, because I'm often wrong. In any case, the Bayesian retreat to "approximate solutions" does not even work - at least in any simple-minded form. Kwisthout, Wareham, & Van Rooij, 2011 in Cognitive Science show that "most—if not all—forms of approximating Bayesian inference are for unconstrained input domains as computationally intractable as computing Bayesian inference exactly." (I'm quoting from a later paper by K&VR that does offer some suggestions, as per below. So this "easy" form of retreat - using, say, particle sampling - does not actually solve this problem. They do offer some hope in a later 2013 article via the route of what's known as "parameterized complexity theory": one can isolate the component(s) (parameters) giving rise to the computational intractability and impose problem-specific constraints on their structure. But to repeat, approximation in and of itself does not do the job. (K&VR: Bridging the Gap between Theory and Practice of Approximate Bayesian Inference; J.Cognitive Systems, 2013). Evidently, one must proceed on a case by case basis.

    ReplyDelete
    Replies
    1. Thanks for this! I pointed to the Griffiths et al. as a recent statement of their conception of things, one which emphasizes bringing in resource constraints so that the picture isn't one where one's just working on Marr's computational level. In Vul et al.'s 'One and done' (which the Griffiths et al. piece cites and which is discussed by the Icard piece Norbert mentions in his original post), it's argued that the probability matching that's at the heart of Eberhardt and Danks' argument is the result of the approximate algorithm that factors in time-costs. With not implausible assumptions about time-cost, it turns out that just taking a few samples -- even just taking *one* -- is optimal (given that one's sampling--see below). So, probability matching, rather than only as a problematic departure from optimality at Marr's computational level, can be seen as what you should expect at this "intermediate" level of analysis between Marr's computational level and his algorithmic level -- i.e., at this level that factors in the cost of thinking. So, that's how it's supposed to address Eberhardt and Danks. To be sure, that doesn't yet show that the approximation is itself tractable (though, if memory serves, Vul et al.'s brief remarks about K,W, and VR seem to indicate they think there's not a problem for them; certainly taking one sample doesn't seem very demanding, but I don't think Vul et al. make any commitments about how the sample gets taken in the first place, unless I'm forgetting). And, yes indeed, it doesn't tell us why one should expect optimality (or even satisficing) relative to resource restraints. (There's no "teleological" explanation of the sort they claim their framework gives unless they can actually defend a relevant story of how we came to be this way, assuming we are this way.) Another thing (as Icard I think points out) is that Vul et al. don't actually argue that sampling *is* optimal given the time-cost. They ask rather: *if* one's sampling, how many samples would be optimal given time-costs? They don't argue that there couldn't be something that beats sampling in the first place.

      Delete
    2. Thanks again for the pointer to the Vul et al. paper, Steven, which was indeed enlightening (though perhaps not in the way the authors intended). I have had just barely enough time to skip through the Vul et al. paper, so a deeper and sharper analysis will have to wait. For now, though, my take is that T & friends aren’t out of the roads, not by a long shot. First of all there’s a crucial bit of moving-the-goal-posts going on that I think Steven also picked up on: the noble enterprise that began as an attempt to calculate an optimal Bayes posterior, but now known to be not even approximately computable in polynomial time, has been replaced by the goal of optimal decision making, a different kettle of fish. If it’s just decision making we’re worried about, then other, computationally much cheaper means might be available - such as ordinary maximum likelihood estimation (MLE). That’s exactly what I and my co-authors found out when we started re-examining T & friends various attempts at modeling the language domain — in the 2013 LSA we show that MLE works just as well for figuring out alternating verb classes as the horrendously more complex Bayesian methods. Second, the “1 and done” sampling method sounds awfully, awfully, frighteningly close to what Nature never serves us: a free lunch. If it were really possible to get close-to-optimal results with just 1 or 2 samples (and yes, I mean the reading that *if* one is sampling, then...) why hasn’t this swept the field of decision theory? None of my colleagues in the Laboratory of Information and Decision Systems here at MIT have heard of such a thing. But this will have to wait for a more careful reading.

      Delete
    3. Perhaps a small thing: I'm not sure if one should describe the goal-post-moving as moving from calculating the optimal posterior to optimal decision making. Eberhardt and Dank's point is that, if the decision making is optimal, then one shouldn't find the probability matching that one does find and that Bayesians then mistakenly take as evidence of optimal decision making. Vul et al. (and Griffiths et al.) reply by moving the goal post from optimal decision making to optimal decision making with resource constraints (in their worked examples, from optimal decision making where more time has no cost to one where it does--and where sampling from the posterior takes time and thus incurs a cost).

      Delete
  4. Hi Steve, I guess I didn't write clearly enough, and you said what I meant much more clearly – yes, moving the goal posts to say that we're now going to consider optimal decision making under resource constraints is right. But then, as I note, there are many other methods that do as well or better - like concrete example I pointed to. Since all of the 3 or 4 'language related' examples T&friends have published can be replaced by much more efficient MLE to arrive at the same results - this is what we have shown - so far at least I can't see any reason to prefer their Bayesian approach to, e.g., MLE. It would be great if someone could provide a concrete example that demonstrates otherwise. Beyond that, there are technical issues re sampling from posteriors that, as I mentioned, will have to wait for another day. The worry I have is that they really *do* think they don't have to show anything more than what they've done, whereas, by my lights, they are still in the woods. I'll write about the technical biz more in a few weeks, when the workload lightens a little here.

    ReplyDelete
  5. Thanks -- I look forward to it! Hope you have the time -- this is very helpful for me.

    ReplyDelete
  6. I don't get the MLE/Bayes dichotomy. There is no categorical difference between "Bayes" on the one and "MLE" on the other hand. Depending on your loss function, using the MAP _is_ the Bayes optimal response. And depending on your priors, the MLE does co-incide with the MAP. Identifying the MLE, then, is a variety of performing "Bayesian" MAP inference. (Yes, you can justify the MLE in entirely "unbayesian" terms, but so what?)

    Now I agree that that's not what people usually seem to mean when they talk about Bayesian inference, although in many cases, it is what they actually are doing - at the end of the day, the posterior is collapsed down to the MAP; and it's not exactly hard to pick your priors such that the MAP just is the MLE (although in the cases I've seen, that's usually not a good idea). However, I fail to see any argument that establishes some particular shortcomings of Bayesian modeling, in particular as opposed to "MLE".

    Granted, there's a lot of "overselling", and parts of the criticism I sympathize with. But I see a danger of throwing out a very useful tool because of disagreements with particular (possibly overblown) interpretations of aspects of it. The analogy would be a rejection of any kind of formal grammar because one doesn't think that that could possible "psychologically real". Which, incidentally, is a (rather unfortunate) attitude some people do take, and it doesn't exactly seem to lead to the best kind of work...

    ReplyDelete
    Replies
    1. @BB:
      At least in my piece I did not intend to impugn the tool. In fact, what I wanted to argue was that it was one tool among others, rather than an especially privileged one. There are parts of the cogsciesphere where being Bayesian just is being a cog scientist. Nothing other will do. And there are arguments made to the same effect. This is what I think needs some more justification, and you seem to agree given your last paragraph.

      As regards RCB and Yang, I think that they would be happy were Bayes taken as one tool among many, the problem then being to justify which tool should be used. However, as they are not mutes, nor shy, I leave this for them.

      Delete
  7. Trying to answer my own question far above about what the original, Chomskian, evaluation metric is, I suggest that it is a technique for formulating nonconstructive hypotheses about what the (Marr) computational problem is, namely, finding the 'best' grammar given some specific assumptions about what the levels of representations and rules and/or parameters are. So it's Bayesian/MDL version with the needed fix for the absence of explicit negative evidence would be the same thing (and probably false in detail, because the actual grammar is whatever is produced by the actual learning method, but I think it's still a useful idea for people who are trying to work out how to write grammars).

    ReplyDelete