Monday, April 11, 2016

Pouring gasoline on the flames: Yang on Bayes 3

I want to pour some oil on the flames. Which flames? The ones that I had hoped that my two recent posts on Yang’s critique of Bayes (here and here) would engender. There has been some mild pushback (from Ewan, Tal, Alex and Avery). But the comments section has been pretty quiet. I want to restate what I take to be the heart of the critique because, if correct, it is very important. If correct, it suggests that there is nothing worth salvaging from the Bayes “revolution” for there is no there there. Let me repeat this. If Yang is right, then Bayes is a dead end with no redeeming scientific (as opposed to orthographic) value. This does not mean that specific Bayes proposals are worthless. They may not be. What it means is that Bayes per se not only adds nothing to the discussion, but that taking its tenets to heart will mislead inquiry. How so? It endorses the wrong idealization of how stats are relevant to cognition. And misidealizations are as big a mistake as one can make, scientifically speaking. Here’s the bare bones of the argument.

1.     Everyone agrees that data matters for hypothesis choice
2.     Everyone agrees that stats matter in making this choice
3.     Bayes makes 4 specific claims about how stats matter for hypothesis choice:
a.     The hypothesis space is cast very wide. In the limit all possible hypothesis are in the space of options
b.     All potentially relevant data is considered, i.e. any data that could decide between competing hypotheses is used to adjudicate among the hypotheses in the space
c.     All hypotheses are evaluated wrt to all of the data. So, as data is considered every hypothesis’ chance of being true is evaluated wrt to every data point considered
d.     When all data has been considered the rule is to choose that hypothesis in the space with the highest score

Two things are worth noting about the above.

First, that (3) provides serious content to a Bayesian theory, unlike (1) and (2). The latter are trivial in that nobody has ever thought otherwise. Nobody. Ever. So if this is the point of Bayes, then this ain’t no revolution!

Second, (3) has serious normative motivation. It is a good analysis of what kind of inference an inference to the best explanation might be. Normatively, an explanation is best if it is better than all other possible explanations and accounts for all of the possibly relevant data. Ideally, this implies evaluating all alternatives wrt to all the data and choosing the best. This gives us (3a-d). Cognitive Bayes (CB) is the hypothesis that normative Bayes (NB) is a reasonable idealization for people actually do when the learn/acquire something. And we should appreciate that this could be the case. Let’s consider how for a moment.

The idealization would make sense for the following kind of case (let’s restrict ourselves to language). Say that the hypothesis space of a potential Gs was quite big. For concreteness, say that we were always considering about 50 different candidate Gs. This is not all possible Gs, but 50 is a pretty big number computationally speaking. So say 50 or more alternatives is the norm. Then Bayes (3a) would function a lot like the standard linguistic assumption that the set of well-formed syntactic objects in a given language is effectively infinite. Let me unpack the analogy.

This infinity assumption need not be accurate to be a good idealization. Say it turns out that the number of well-formed sentences a native speaker of English is competent wrt is “only” 101000. Wouldn’t this invalidate the infinity assumption? No, it would show that it is false, but not that it is a bad idealization. Why? Because the idealization is a good one because it focuses attention onto the right problem. Which one? The Projection Problem: how do native speakers go from a part of the language all of it? How given exposure to only a subset of the language does a LAD get mastery over a whole language? The answer: you acquire recursive rules, a G, that’s how. And this is true whether or not the “language” is infinite or just very big. The problem, going from a subset to its containing superset, will transit via a specification of rules whether or not the set is actually infinite. All the infinite idealization does is concentrate the mind on the projection problem by making the alternative tempting idea (learning by listing) silly. This is what Chomsky means when he says in Current Issues” “once we have mastered a language, the class of sentences with which we can operate fluently or hesitation is so vast that for all practical purposes (and, obviously, for all theoretical purposes), we may regard it as infinite” (7, my emphasis NH). See: the idealization is reasonable because it does not materially change the problem to be solved (i.e. how to go from part of the language you are exposed to, to the whole language that you have mastery over).

A similar claim could be true of Bayes. Yes, the domain of Gs a LAD considers is in fact big. Maybe not thousands or millions of alternatives, but big enough to be worth idealizing to a big hypothesis space in the same way that it is worth assuming that the class sentences a native speaker is competent wrt is infinite. Is this so? Probably not. Why not? Because even moderately large hypothesis spaces (say with over 5 competing alternatives) turns out to be very hard to manage. So the standard practice is to use really truncated spaces, really small SWSs. But when you so radically truncate the space, there is no reason to think that the inductive problem remains the same. Just think if the number of sentences we actually knew was about 5 (roughly what happens in animal communication systems). Would the necessity of rules really be obvious? Might we not reject the idealization Chomky argues for (and note that I emphasize ‘argue’)? So, rejecting (3a) means rejecting part of the Bayes idealization.

What of the other parts, (3b-d)? Well, as I noted in my posts, Charles argues that each and every one is wrong in such a way as to be not worth making. It gets the shape of the problem wrong. He may be right. He may be wrong (not really, IMO), but he makes an argument. And if he is right, then what’s at stake is the utility of RB as a useful idealization for cognitive purposes. And, if you accept this, we are left with (1-2), which is methodological pablum.

I noted one other thing the normative idealization above was once considered as a cognitive option within linguistics. It was knows as the child-as-little-linguist theory. And it had exactly the same problems that Bayes has.  It suggests that what kids do is what linguists do. But it is not the same thing at all. And realizing this helped focus on what the problem the LAD faces is. Bayes is not unique in misidealizing a problem.

Three more points and I end today’s diatribe.

First, one can pick and choose among the four features above. In other words, there is no law saying that one must choose the various assumptions as a package. One can adopt a SWS assumption (rejecting 3a) while adopting a panoramic view of the updating function (assuming that every hypothesis in the space is updated wrt every new data point) and rejecting choice optimization (3d). In other words, mixing and matching is fine and worth exploring. But what gives Bayes content, and makes it more than one of many bookkeeping notations, is the idealization implicit in CB as NB.

Second, what makes Bayes scientifically interesting is the idealization implicit in it.  I mention this because as Tal notes in a comment (here), it seems that current Bayesians are promoting their views as just “set of modeling practices.” The ‘just’ is mine, but this seems to me what Tal is indicating about the paper he links to. But the “just” matters. Modeling practices are scientifically interesting to the degree that they embody ideas about the problem being modeled. The good ones are ones that embody a good idealization. So, either these practices are based on substantive assumptions or they are “mere” practices. If the latter, then the Bayes modeling is in itself of zero scientific interest. Does anyone really want to defend Bayes in this way? I confess that if this is the intent then there is nothing much to argue about given how modest (how really modest, how really really modest) the Bayes claim is.

Last, there is a tendency to insulate one’s work from criticism. One way of doing this is to refuse to defend the idealizations implicit in one’s technology. But technology is never innocent. It always embodies assumptions about the way the world is so that the technology used is a good technology in that it allows one to see/do things that other technologies do not permit or, at least, does not distort how the basic problems of interest are to be investigated. But researchers hate having to defend their technology, more often favoring the view that how it runs is its own defense. I have been arguing that this is incorrect. It does matter. So, if it turns out that Bayesians now are urging us to use the technology but are backing away from the idealizations implicit in it, that is good to know. This was not how it was initially sold. It was sold as a good way of developing level 1 cognitive theories. But if Bayes has no content then this is false. It cannot be the source of level 1 theories for on the revised version of Bayes as a “set of modeling practices” Bayes per se has no content so Bayes is not and cannot be a level 1 theory of anything. It is vacuous. Good to know. I would be happy if this is now widely conceded by our most eminent Bayesians. If this is now the current view of things, then there is nothing to argue about. If only Bayes had told us this sooner.


  1. I'll just reiterate and elaborate on the point I made in the previous thread that in order for a model to be Bayesian, it just needs to use Bayes' rule. Nothing about Bayes' rule requires (a) a large hypothesis space or (b) use of all the relevant data. Bayes' rule can be applied to a small hypothesis space, and it can be applied incrementally.

    Given a hypothesis space and some data, it will, per (c), be applied to all the available hypotheses. And, per (d), the decision rule is typically to pick the hypothesis with the highest posterior probability.

    Of course, it is correct to point out both that (a) and (b) are typically employed along with (c) and (d) in Bayesian models of cognition, and that these are substantive claims about cognition. But neither is necessary or sufficient for a model to be Bayesian.

    1. Excellent. So c,d for you. Then you should find CY interesting as it argues that both are wrong in the sense that not all hypotheses ate updated and the decision rule does not choose the best. But thx, that is useful. And it has content in the sense that we can argue against it.

    2. I read the paper last night, and I did find it interesting. For what it's worth, I'm a proponent of Bayesian statistical modeling, but not of Bayesian cognitive modeling. I have always found the Marr-based justifications for it unsatisfying, and I particularly liked what Yang had to say about that.

  2. All this stuff about the size of the hypothesis space and how update works seems like cruft hanging on the theoretical distinction of interest, which is why I think people in the last post were balking at (3). The tenets in (3) are correlated with that distinction, but they don't constitute it.

    For a (subjective) Bayesian, probabilities are (representative of) beliefs. This gets ported over to cognitive theorizing in the form of the following claim: probabilities are first class representational objects that learners manipulate over the course of learning. For the postulation of these objects to have any predictive power (at the level of the cognitive theory), they need to have consequences for, e.g., how learners carry forward uncertainty and ultimately select a hypothesis.

    One way learners might carry forward uncertainty is by (i) tagging grammars with probabilities, subject to (a) some algebraic structure the space of grammars has and (b) some way of measuring that structure that satisfies certain rules (e.g., the Kolmogorov or De Finnetti axioms) and then (ii) updating those probabilities as data come in, again subject to those axioms and possibly some auxiliary assumptions. They might also carry forward uncertainty in ways that technically violate those axioms but within some approximation bounds. (More on approximation below.)

    So, one way the picture above could be wrong is that learners don't carry forward uncertainty at all. For instance, they might just hop between states in the hypothesis space, guided by some algorithm that may or may not have theoretical guarantees wrt choosing the "optimal" grammar (whatever that means). And that algorithm may well exhibit properties in aggregate that do just what a learner that actually represented the probabilities do (or it might not). But that doesn't mean that the learner represents and manipulates the probabilities in any meaningful sense; it's just a fact about the structure of the algorithm.

    I may be reading you and Yang wrong, Norbert, but for your particular argument, can't you allow back-off to the approximation claim (within reason)? Because what's important is the claim about probabilities as first class cognitive objects. And if someone with a level 1 Bayesian theory doesn't want to make *that* claim, I think I agree that there's no value added from the theory being Bayesian (whatever it means for a theory to be Bayesian in that case). But I disagree that it's problematic as a matter of methodology, since either way, you need tools for understanding how a particular algorithm relates to a particular algebraic structure and probability measure on that structure (presupposing the truism that "stats matter in making this choice"). Just like in syntactic theory, different notations may make it easier or harder to state particular generalizations, but in the end, one can do uninteresting work in any notation, since the notation isn't what's important.

    1. "For instance, they might just hop between states in the hypothesis space, guided by some algorithm that may or may not have theoretical guarantees wrt choosing the "optimal" grammar (whatever that means)."

      Say you have a collapsed Gibbs sampler. The implementation of this may not explicitly represent the probabilities. The probabilities are just the normalised ratios of certain states in the sampler. So it seems that a Marr level 1 model of this would be saying: this algorithm is approximating the appropriate integral, i.e doing Bayesian inference.

    2. Just to clarify Aaron, you are suggesting that the critical commitment of a 'cognitive Bayesian' is that humans represent probabilities over two or greater hypotheses? And were you going on to say that even if humans don't maintain representations that work exactly like real probabilities, a fairer reading of the cognitive Bayesian's core commitment would just be that humans maintain *any* kind of information about the uncertainty of more than one hypothesis?

    3. @Ellen: No. I agree that there are further requirements on how beliefs are manipulated for a Bayesian---namely, that they're subject to whatever the probability axioms are. When I say "violation" (within reason), I mean, e.g., making independence assumptions where they aren't warranted and other things like that can make inference easier.

      One could presumably construct an alternative system that was similar in assuming most or all of Norbert's (3) and that represented beliefs as first class objects---but not as probabilities. Nothing about (3) is explicitly probabilistic, which suggests to me that what Norbert is targeting is more general. For instance, gradient symbolic computation might fit under (3). In that case, the learner wouldn't be Bayesian but it could carry forward uncertainty (whatever the analogue of uncertainty is there).

      But in the end, the reason I think it's useful to state the distinction in this way is not to clarify the cognitive Bayesian position, which I think gets muddier, but to clarify the Norbert/Yang position, which becomes sharper. I'm trying to figure out if the position they're pushing is a sort of cognitive frequentism, wherein probabilities (or tolerance thresholds or whatever) may be first order "beliefs" (about the cognitive objects of interest) but never second order beliefs (about other beliefs about the cognitive objects of interest). ("Beliefs" in scare quotes because, for a frequentist, probabilities are never beliefs, though for a cognitive frequentist, it might be useful to think of them that way.)

    4. @Alex: if your point is that a particular Bayesian model might be a useful level 1 description of what some algorithm is computing, I agree---not least because as our models expand to explain different aspects of cognition (even just within the language faculty), we need principled ways of hooking those models together. (I think, though, that to make good on this approach, one needs to state constraints beyond those laid down by probability theory to give the learning theory content---e.g., some grammar of graphical models beyond that imposed by variable typing.)

      I take Norbert's question to be specifically how transparently level 1 descriptions map onto level 2 learning algorithms. So, do learners really track uncertainty as a separate second order belief (possibly subject to constraints on what constitutes a belief) or do they use some algorithm that has the same result at the level of the population but doesn't explicitly track probabilities.

      So for instance, in Bonawitz et al. (2011) "A Simple Sequential Algorithm for Approximating Bayesian Inference", the authors compare win-stay lose-shift (WSLS) with random sampling from a posterior for some model and show that "the marginal distribution over hypotheses after observing data will always be the same for [WSLS] as for sampling from [that] posterior". They furthermore show that these algorithms make different predictions at the level of the individual and that (one version of) WSLS fits the experimental data better than random sampling. This, which is of a piece with the Medina et al. and Trueswell et al. results, might be a point in favor of learners not tracking second order beliefs. (Maybe.)

    5. So, the discussion, surprise surprise, has gotten over my head very quickly. I hope Charles is looking at this and maybe he can jump in. But, let me try to say something that might relate to what you are getting at.

      CY proposes a principle (the principle of sufficiency) which states a principle for generalizing instances to rules. This principle is computationally tractable and can be applied. It is proposed as a description of a psychological mechanism that is operative in the LAD and explains not only end states but the dynamics we see kids going through. So, there is good evidence that this is a principle describing the acquisition system.

      Now, is THIS principle Bayesian? CY argues not, and I agree. You can see CY's discussion around p23 for the reasoning. At any rate, say that CY is correct about the principle. If it is, one can ask another question: WHY do we have this principle? This is family related to the kinds of minimalist questions we typically ask: why this UG and not another. Here one can imagine a Bayes answer: we have this CY principle BECAUSE it optimizes acquisition under the conditions in which acquisition applies. What are those conditions? Well that's what needs to be supplied. So, on can imagine a second order Bayes explanation for the first order non-Bayes rule. One can imagine this, and evaluate the proposals once made.

      There is, of course, a problem with these kinds of explanations, one which minimalists are all too familiar with. It's sometimes a bit too easy to find "constraints" that make one's favorite principle optimal.This is also the bane of cost-benefit analyses in general. So, we want some reasonable empirical argument demonstrating that the constraints in fact hold. Note, that this kind of account aims for an account of how the principle arose. We have left the domain of cognition and entered the domain of evo speculation. That's of course fine with me. And maybe this speculation will lead to changes in what we take the first order principle to be. But, if CY is right, then it will have to derive the the CY first order principle (at least more or less). Like I say, this would seem to me to be a coherent research program. To my reading (slight as it is) this is not what we see.

      What do we see? We see Bayes proposed as a level 1 theory aimed at describing the computational problem that needs solving. In one sense the problem is easy to describe: how do you use data to settle on a hypothesis. Bayes, as I understand it, makes the following idealization: it is a species of rational induction. In other words, it has the properties of NB. This is a nice concrete proposal. CY argues that this concrete proposal is wrong in every way imaginable. The fall back is not that Bayes is a level 1 theory answering how we use data to settle on hypotheses but a theory of why the principles we in fact use to settle on hypotheses are the ones we have. This theory answers a different question. One moral of CY might be: get your questions straight.

      Now, all this blathering might be besides the point. I hope not. But if so, sorry.

  3. I'm still not convinced that there's any particular cognitive theory that is committed to 3a-d. I'm especially surprised by the last point (When all data has been considered the rule is to choose that hypothesis in the space with the highest score) because it seems to me that the "most Bayesian" thing to do is to never choose any particular hypothesis, and rather marginalize over all of the hypotheses based on their posterior probability. It might be more productive to discuss one specific published proposal.

    1. A Bayesian classifier chooses the option with the highest posterior probability. You can modify the choice rule by taking into account differential payoffs, but the basic idea is the same. As far as I know, this is necessary for any claims about optimality.

      I've been frustrated by the way optimality is discussed in some papers proposing Bayesian cognitive models, largely because it's not clear what is meant by "optimal" in many cases. Or, more to the point, it's not clear that the way in which Bayesian classifiers are provably optimal is what they're talking about. Sometimes it seems like they're talking about something more like statistical optimization, i.e., finding the maximum (or minimum) of a function, which is related, but not the same thing.

    2. I'm not sure what you mean - in what way is language acquisition a classification problem?

    3. Sorry I wasn't clear. I was trying to make the point that the choose-the-largest rule is directly related to optimality. So, the the extent that optimality is what someone is after, then Norbert's (d) is important.

      I don't know enough about the specifics of the Bayesian models that Yang is criticizing to say much about them with any authority, but I don't think it would be particularly useful to treat them as classifiers. To the extent that they are not classifiers, this poses potential problems for claims that they are optimal in any interesting sense.

    4. I think Tai's point is that the optimal thing to do is often not to pick a single hypothesis but to sum (integrate) over all hypotheses weighting them by their posterior probability. For me this is what "real Bayesians" do.

      So one (very narrow) way of viewing language acquisition mathematically is as predicting

      Pr(next sentence | all sentences seen so far).

      Picking G = maximum a posteriori grammar given the data and then using

      Pr(next sentence | G)
      is not optimal.

      It is better to use

      sum over all G ( Pr (G | data) Pr(next sentence | G))

      where the sum will probably be integration.

      (I am sure you know this Noah, just for clarity for others).

    5. Thanks, Alex - that was indeed my point. Norbert's principle 3d, Noah's argument, and Charles Yang's argument that Bayesian models can't accommodate variation and language change (section 3.1 in the paper), all seem to be arguments against maximum a posteriori point estimates rather than "true" Bayesian models that consider the full posterior.

    6. I have no idea what you two mean by a "true" Bayesian, but let me just note that your use of the term seems at odds with what the hardcore stats community takes it to be. So, for example, Sahlizi (here: discusses a paper by Eberhart and Danks (E&D) that discusses your proposed Bayesian decision rule. You can call your rule Bayesian if you want, but it is quite different from how Bayes is understood by THESE pros. Far be it from me to argue over an honorific term. The substance is that once one moves away from the normative underpinnings of Bayes, what you are left with is technology. So far as intellectual matters go, there is no there there. As E&D put it: "The alternative strategy of weakening the underlying notion of rationality no longer distinguishes the Bayesian model uniquely. A new account of rationality — either for inference or for decision-making — is required to successfully confirm Bayesian models in cognitive science." Quite right, remove the optimization part of the decision rule and you are no longer playing the Bayes game. The game might be worth playing, but it is a different game, one where the Bayes part is there purely for PR reasons.

    7. @Norbert: Here is David MacKay responding to a similar question:

      > From: Herman Bruyninckx
      > Subject: Non-invariance of MAP...
      > In Bayesian methods for neural networks - FAQ, you make the
      > remark that ``Some people (including many Bayesians!) have the impression
      > that Bayesian inference is all about finding MAP parameters (maximum a
      > posteriori) but this is a mistaken view.'' This answer leaves me burning
      > with curiosity about what your answer is to the following question: what
      > then *is* Bayesian inference all about? I mean, what decision processes are
      > `allowed' (in the sense of being invariant)?

      David J.C. MacKay responds:
      "Decision theory very rarely involves MAP. The optimal decision is the one that minimizes the expected cost. The expected cost is found by MARGINALIZING over the posterior distribution. Only in toy problems will you find that the decision from marginalizing is the same as MAP."


    8. Karthink, could you work through an example for me so that I see this. Certainly in the parade cases, probability matching fails to maximize the return (or minimize the cost). You know the cases with rats and food. So, what cost structures leads to non MAP?

    9. @Norbert: I was just countering the "hardcore stats" view that you presented. MacKay is as hardcore as anyone can imagine when it comes to Bayesian thinking, and he clearly states that MAP is not reasonable or optimal in more realistic settings according to him.

      As far as examples go, I am not sure of what exactly you are asking for. Perhaps, if you clarify either me or others around might try.

      P.S. - I am not defending/arguing against the bayesian position. That's a tangential issue for this particular comment thread, which I see more as establishing what the mainstream bayesian position is.

    10. Other things people have discussed above in the comment thread that are tricky:
      a) Use of Bayes's Theorem implies Bayesian models. It's not necessary according to statisticians (Wasserman is a very respectable statistician, AFAIK). A standard frequentist will also use Bayes Theorem where appropriate according to them. The disagreement is about where it is appropriate.

      b) Bayesianism says probabilities are beliefs. Again, not necessary, unless you are a subjective Bayesian. The claim totally ignores Objective Bayesians, and there are many respectable statisticians who practise Objective Bayes.

      The big problem given such a disparate field is that arguing against "bayesian models" is likely not possible. It might be more fruitful to say that one is arguing against a certain set of bayesian models with particular claims attached (as I see it, Charles seems to be doing this - thought I need to read the paper much more carefully).

      So, it is not that Charles's criticisms are not well-placed - it is that they are not likely to convince all bayesian given such a large variety of stances within.

    11. Andrew Gelman, whose Bayesian credentials are unassailable, is no fan of model averaging. You can search for the term on his blog to find some commentary, and you can note that the term gets mentioned all of twice (one instance of which is fairly dismissive) in the third edition of Gelman et al's Bayesian Data Analysis. The point being that, arguments about what a True Scotsman, er, Bayesian would do aside, I don't think model averaging is a necessary (or sufficient) component of Bayesian modeling.

      It's probably also worth noting that averaging across all possible grammars does nothing to help with the well-known computational problems with Bayesian cognitive models. It seems to me that it would probably make it worse, since it's an additional (costly?) computation over a (very) large hypothesis space.

      Finally, with respect to optimality, in what sense would the integration/sum across all grammars be optimal? Just with respect to getting the best estimate of Pr(next sentence)? Could this then be used to pick a grammar? Or is picking the right/best grammar not a universally agreed upon goal? This is what the non-Bayesian GG folks assume is the point, right? I was under the impression that, e.g., Griffiths, Tenenbaum, et al, shared this assumption.

    12. @karthik: Regarding (b), you'll note that this fact was mentioned explicitly: "For a (subjective) Bayesian, probabilities are (representative of) beliefs." If the question is what kinds of cognitive representations the modeler is committed to, which it presumably is, what is the relevant distinction between a subjective and an objective Bayesian? I'm not asking what what the difference is as a matter of philosophy; I'm asking why you bring the distinction up in a discussion of cognitive models. I don't know of any modelers that rest on a cognitive analogue of this distinction as a way of generating a hypothesis about cognition.

    13. On the optimality issue, the MAP procedure will often give completely wrong answers. Here is a toy example.

      Suppose we have two probabilistic grammars G1 which generates A with prob 0.51 and B with prob 0.49, and G2 which generates A with prob 0.49 and C with prob 0.51.
      And we have an even prior.

      The learner sees an A. On the MAP approach you reason that the MAP estimate is G1 and so the prob of seeing a C next is 0. But this is just wrong. The probability of seeing a C given that the first is A is in fact about 0.25.

      Gelman's concerns are in Bayesian statistics rather than Bayesian learning so perhaps that explains his dismissive comments? But I agree that Bayesianism is a broad church. However, I think Norbert's 3 a-d,characterise pretty much all probabilistic learning including non-Bayesians, and one of the ways in which Bayesians differ from the norm is through rejecting d).

    14. Not to get too far into the weeds on this, but I'm going to get too far into the weeds on this. I just find it interesting, even if I'm veering increasingly off topic.

      It seems to me that the Pr(C|A) is only "in fact about 0.25" if you have a model that averages over G1 and G2. This may well be (or not be) a good model of grammatical knowledge. Or the fact that we've observed A and C could mean that neither G1 nor G2 are good grammars, so maybe it's just that our hypothesis space is poorly constructed.

      Anyway, like I said, increasingly off-topic, so maybe not worth pursuing any more...

  4. I think the distinction between normative and cognitive models is very important. I deal in pragmatics, where there is room for both types of models. Most "pure" models of neo-Gricean reasoning (like game-theoretic models) are normative in the sense that we expect people in the real world to reason imperfectly and deviate from what the model predicts is optimal. For example, implicatures may be predicted by a normative model which are in reality difficult to get, because people bring their own biased reasoning mechanisms to the table, and these often undercut Gricean reasoning. But nonetheless, there is room for normative models in pragmatics, because people do tend to follow conversational conventions that either match or approximate what is normatively optimal. For those cases, Bayesian models are appropriate.

    But I have been somewhat critical (here) of the application of Bayesian cognitive modeling to pragmatic questions, though perhaps I'm more critical of its application in practice than in theory. That is to say, maybe it's possible to do it right, but it seems to me that often much of what's interesting gets dumped unceremoniously into the prior without further investigation. This creates the illusion that you've explained something when really you've just described it.

    But this view---that Bayes is useful for normative models and not so useful for cognitive models---spells trouble for the application of Bayesianism to the rest of linguistics, as Charles does a good job pointing out. Thinking especially of the problem of syntax and language learning: What would a "normatively optimal grammar" look like? It's at worst an incoherent notion, and at best irrelevant to actually doing linguistics. The only way the learner can be seen to be "optimizing" her grammar is if a whole host of complex constraints are built into the prior in order to capture the fact that (i) the learner doesn't always use all of the data available to her, and (ii) the learner often makes inferences far beyond what the data has justified (i.e. poverty of stimulus). And if you were to do that, you'd want to investigate where these constraints on the prior come from (otherwise you haven't explained much). And that investigation, to me, is exactly what generative linguistics is in the first place. Is it possible to capture island constraints with Bayes' rule? Surely it is... but what's the use?

  5. Backing off a bit from Bayesian technicalities, one important point in Charles' paper is the demonstration that there is a quantitative but not statistical (at least, not immediately statistical) factor in language acquisition, the Tolerance Principle.

    On the other hand, the category of a-adjectives does seem like a special sub-part-of-speech of English, and I'm not getting a clear idea of what drives its acquisition, so is Bayes really dead here? Note for example that the a-adejctives have various '(often partially) functionally adequate substitutes' (I wanted to call them paronyms or paranyms, but both of these terms are already taken. So we can say 'a burning house' instead of 'an aflame house', or 'a noisy/babbling/crying/complaining/lively/... child' instead of 'an awake child' (when we wish to point out that a child is awake, it is usually because it is satisfying one or many of the above descriptors). So perhaps a grammar that recognizes this class and imposes the restriction is overall Bayesianly better than one that doesn't (but focussing on successful prediction of the next word, as it is my impression that Michael Ramscar and his group does, might be another way of thinking about it; I don't know if they amount to the same thing or not in the end).

    & even if, technically, there is no there there for Bayes, it is I think still a reasonable basis for the thought that if you are an ordinary generative grammarian trying to 'tune' the formalism so that it makes it easier to describe stuff that does happen a lot, and harder or impossible to describe stuff that doesn't happen much or at all, your activities might well contribute to an explanation of why languages are learnable, and how.