Monday, August 5, 2013

My Problems with Reverend Bayes


The title of this post is doubly misleading. First, I have no problem with the Reverend. He’s never treated me badly, most likely because he’s been dead for some time. It’s the modern day application of his rule that disturbs me. Second, my “problem” is more a discomfiture than a full-blown ache. I don’t know enough (though I wish I did, really) to be distressed. However, uncomfortable I am (say this as Yoda would), and have decided to post about this unease in the hope that a kindly Bayesian will take pity on me and put my worries to rest. So, what follows tries to articulate the queasy feelings I have about the Bayesian attitude towards cognitive problems, especially in the domain of language, and I put them here, not because I am confident in the following remarks, but because I want to see if the way I am thinking about these matters makes any sense.[1]  So let’s start.

First, these divagations have been prompted by a nice comment by Mark Johnson and reply by Charles Yang and comment by Aaron White here. The comments lead me to think about my worry in the following way. Be warned, it takes a little time to get there.

As you know, I am obsessed with the contrast between rationalist (R) and empiricist (E) conceptions of mind. This contrast has often, IMO, been misunderstood as a contrast between nativist and non-nativist conceptions. However, this cannot be correct, for any theory of acquisition, even the most empiricist ones requires natively given structures to operate, i.e. every theory of “learning” needs a given (i.e. native) hypothesis space against which data provided by the environment is evaluated.  So a better way of marking the R/E contrast is in how articulate the hypothesis space in any given domain is. Es take as their 0th assumption that the space is pretty wide and fairly flat and that learning, (i.e. the sophisticated use of environmental information) guides navigation through the space. Rs take as their 0th assumption that cognitive spaces are pretty narrow and highly structured and that environmental influences, though not unimportant, play a secondary role in explaining why/how acquirers get from where they start to where they end up.  If there are lots of roads from A to B then finding the best one can be a very complicated task. If there is just one or two, choosing correctly is not nearly as complicated. This I take to be the main R/E difference: both concede the importance of native structure and both leave a role for environmental input. The difference lies in the relative weight each assigns to these different factors. Es bet that the action lies with good ways of evaluating the environmentally provided information. Rs lay their money on finding the narrow set of articulated options.  With this as background, here’s my problem with Reverend Bayes' descendants.[2]

I believe that Bayesian methods generally favor the first conception of the acquisition problem. I am pretty sure that this is not required, i.e. there is nothing in Bayesianisn per se that requires this. However, one reason to use fancy counting methods (and they can be fancy indeed (think Dirchlet)) is the belief that how one counts is largely causally responsible for where one ends up. In other words, Bayesian methods are interesting to the degree that the set of options is wide. If so the real trick is to figure out how to efficiently navigate this space in response to environmentally provided information. Thus, Bayesian affinities resonate harmoniously with E-like background assumptions. Consequently, if one believes like I do that a good deal (most?) of the interesting causal action lies with constrained and articulated shape of the hypothesis space then one will look on Bayesian predilections with some suspicion. Put more diplomatically, if there is a trade off between how tight and structured the space of options is and how complex and sophisticated the learning procedure is then Rs and Es will place their research bets in different places even if both features (i.e. the shape of the space and the nature of the learning theory) are agreed to be important.

If this is so, then the problem with Bayesiansism from my Rish point of view is that it presupposes an answer to (and hence begs) the fundamental question of interest: how structured is the mind? 

You can get a good taste of this from Perfors et. al.’s paper (here). What it shows is that if one starts with three grammatical options, a linear grammar, a simple right branching grammar and a phrase structure grammar (PSG), then there is information in the linguistic input that would favor choosing PSGs and that an (ideal) Bayesian acquisition device could use this information to converge on this grammar.  This conclusion gets used to argue that linguistic minds need not specify that the choice of hierarchical grammars by LADs (viz. PSGs) is “innate” for a Bayesian learning mechanism suffices to reach this same conclusion without ruling out the other options nativistically. Putting aside whether anyone argued for what Perfors et. al. argued against (see here for a very critical discussion), what is useful for present purposes is that Perfors et. al. illustrate how Bayesians trade Bayesian learning for restrictive hypothesis spaces. Indeed, in my experience (limited though it is) I’ve noticed that one thing Bayesians never mind doing is throwing in another option into the space of possibilities, secure in the knowledge that, in the limit, the Bayesian learner will get to the right one no matter how big the space is.

To repeat something I said earlier, it is possible that Bayesian methods of counting have advantages even in highly structured hypothesis spaces of the kind that syntacticians like me are predisposed to think exist in the domain of language. However, this is what needs showing, in my view. However, and this lies behind some of my worry, one can view the SYTG paper discussed in a prior post (here) as arguing that in such a context such methods are not at all helpful. Indeed, they point in the wrong direction. Berwick gave similar arguments at the last LSA for morphological acquisition. In both kinds of cases, it seems that in the more restricted hypothesis domains that they consider much simpler counting procedures do a whole lot better, and for principled reasons. So, even though there is nothing incompatible between the Reverend’s rule and structured minds, there is an affinity between Eism and Bayesianism that shows up both in theory (how the computational problem is posed) and practice (in concrete proposals for dealing with particular problems).

So, why do the Reverend’s acolytes bother me? Reason/feeling number 1: because their conception of acquisition is largely environmentally driven and my Rish sensibilities based as they are in what happens in the domain of language leads me to think that this is wrong. And not just a little wrong, but deeply wrong. Wrong in conception, not wrong in execution. Or, to put this in Marr’s terms, Bayesians in their E-nishness have misconstrued the computational problem to be solved. It’s not how do we use environmental input to navigate a big flat space but how do we use such data to make relatively simple structured choices.

The Marr segue above leads to a second (and largely secondary) source of my unease. Bayesians often describe their theories as Level 1 computational theories in Marr’s sense (see here in reply to here). Here’s Mark Johnson, for example, from the above linked to comment. I interpret “probabilistic” here as “Bayesian.”

A probabilistic model is what Marr called a "computational model"; it specifies the different kinds of information involved and how they interact.

There is a good reason for why Bayesian’s endorse this view of their proposals; interpreted algorithmically these theories often seem to be computational disasters (e.g. here and here). Suffice it to say that there seems to be general agreement that Bayesian analyses do not lend themselves to easy transparent  algorithmitization. Oddly, as Stabler notes here, in discussing the efficient parsability of MG definable languages, this is not the case for standard minimalist grammars:

In CKY and Earley algorithms, the operations of the grammar (em and im) [internal and external merge, NH] are realized quite directly [my emphasis, NH] by adding, roughly, only bookkeeping operations to avoid unnecessary steps (p.8).

Indeed, in my limited experience most generative parsing models starting from the Marcus parser onward have had relatively transparent relations between grammars and parsers, and this was considered a virtue of both (see here).  Indeed, since grammars to get used and seem to be used relatively effectively, it would be copacetic if there were a nice simple relation between competence grammars and parsing grammars (between grammars and algorithms that are used to parse incoming “language”).  However, from what I can gather, this is something of a problem for Bayesians, as it seems that the move from a level 1 competence theory to a level 2 algorithmic accounts won’t be particularly simple or transparent as the simple transparent ones seem to be a computational mess. 

My earlier post on word acquisition (here) touches on this, noting that the procedures that SYTG ran to test the Bayesian story was a pain to run and that this is a general feature of Bayesian accounts. However, I am sure that the issue is very complex and that I have probably misunderstood matters (hence my setting down my worries to act, I hope, as useful targets).

Let me end on one thing that I like about Bayesian approaches. It seems that they have a good way of dealing with what Mark calls chicken and egg problems. Bayesians have ways of combining two difficult problems to make the solving of each easier. This looks like it would very often be a useful thing to be able to do. And to the degree that Bayesians offer a compelling way of doing this, this is a good thing. A question: is this kind of solution to the chicken-egg problem limited to Bayesian kinds of analysis or is it a property that simpler counting systems could encode as well? If it is a distinctive property of Bayesianism, that would seem to be a very nice feature.

Let me abruptly end here and let the target practice (and enlightenment) begin. 



[1] This is one of the nicest things about blogs. In contrast to articles where you need arguments that start from reasonable premises and go in a coherent direction, in a blog post it is possible to ruminate out load and hope that with a little help from your “friends” a little more clarity might be forthcoming.
[2] It is a curious consequence of the Bayesian position that, at first blush, they postulate a whole lot more native givens than Rs typically do. I suspect that this is related to what Gallistel and King call the “infinitude of the possible.” The Bayesian way is to load the hypothesis space with LOTS of options and winnow them down using environmental information. This puts a lot into the space. If the space is given (i.e. innate) then this approach has the curious property of loading the mind with a lot of stuff, much more than Rs typically consider there.  So, in a curious sense, Es of this stripe are far more “nativist” than Rs are.

21 comments:

  1. Target practice...

    (1) "I believe that Bayesian methods generally favor the [empiricist] conception of the acquisition problem. I am pretty sure that this is not required . . . If this is so, then the problem with Bayesiansism from my Rish point of view is that it presupposes an answer to (and hence begs) the fundamental question of interest: how structured is the mind?"

    The problem is that you presuppose an answer to the fundamental question of interest: how empiricist is the Bayesian? This is a strawman - a common theme in replies to Bayesians - and is, moreover, false. As such, there isn't really much more to be said. See the replies to Jones and Love's BBS article (same strawman), which mainly said what I'm saying here, much more articulately, and many, many times.

    (2) "However, one reason to use fancy counting methods . . ."

    Probabilistic inference is not fancy counting. Probabilistic inference is not fancy counting. Probabilistic inference is not fancy counting! Probabilistic inference is (one form of) generalized deduction, and a consequence of using it is that it so happens that more evidence is better evidence in most circumstances. As a result, the math will shake out to be sensitive to how much data you have, which I guess is vaguely isomorphic to counting. But this is way too generous; I'm not sure how estimating the center of a cluster of data points in space by finding the mean of the observed data is really "counting" in any reasonable sense, even if it is sensitive to the number of data points.

    ReplyDelete
  2. "Probabilistic inference is (one form of) generalized deduction, and a consequence of using it is that it so happens that more evidence is better evidence in most circumstance."

    I take it that this is precisely what's at issue, no? It seems that in many circumstances, e.g. SYTG's, this is false. So isn't the assumption that it is generally true begging a question? And if this is your view, won't the main problem be getting lots of evidence and figuring out how to weight it? Again, there is no reason to think that Bayesians need have weak and unstructured hypothesis spaces, but many do and concentrate on figuring out how to massage the data to get where you want to go, as e.g. Perfors et.al. do. I will however, take a look at the BBS piece. Thx.

    ReplyDelete
    Replies
    1. "Less is more" had a very Bayesian-friendly way of implementing this idea: working memory limits the ability of the learner to use all the data (in the "optimal" way). Lisa Pearl implemented this in inference and found it was true for segmentation. There are ways you can imagine similar pressures being embedded right in the prior, rather than in the search procedure, with UG helping you to use less of the data relevant to one part of the problem if it helps you solve another part. I don't doubt that less may be more, or at least, more like humans - but I also doubt that it's as simple as SYTG propose when we look at a bigger picture. After all, the whole point of "Pursuit with abandon," as you point out, was to soften an overly-committal learner. There are, of course, two ways to try and answer the question of "how exactly / under what circumstances is ignoring evidence crucial", one is bottom up (implement an algorithm) and the other is top down (implement an explanation, e.g. the Pearl stuff). Bayesian tools are not the only solid base around which we can build do the latter, but they're certainly very handy, and I see no reason to abandon them yet.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. "It is a curious consequence of the Bayesian position that, at first blush, they postulate a whole lot more native givens than Rs typically do. I suspect that this is related to what Gallistel and King call the “infinitude of the possible.” The Bayesian way is to load the hypothesis space with LOTS of options and winnow them down using environmental information. This puts a lot into the space. If the space is given (i.e. innate) then this approach has the curious property of loading the mind with a lot of stuff, much more than Rs typically consider there. So, in a curious sense, Es of this stripe are far more “nativist” than Rs are."


    This is an interesting point; suppose you have a prior P2 which only has two hypotheses H1, H2,
    and a prior P4 which has four, H1,H2,H3,H4.
    So from one point of view P2 has a "smaller" nativist component,
    since it has less grammars, but from another P4 is weaker because it is lower entropy and less informative. Both of these intuitions are in my view valid, though the latter is more standard in Bayesian terms.

    It depends on whether you have an information theoretic view of complexity (Shannon) or an algorithmic view (Kolmogorov). Information theoretically we say that P4 is less informative than P2 but P4 will normally be algorithmically more complex than P2.

    Of course if the class of hypotheses is something like all MGs or MCFGs, then this is very low on both scales.

    (previous comment had a typo that meant it didn't make sense).

    ReplyDelete
    Replies
    1. I dunno about the first intuition being valid. Someone once said to me, "I figured out why I don't like Bayesian methods. It's because you have to pick a hypothesis space." I told them why this was confused, but it sounds like the first intuition follows the same lines: "If I have to say what's possible, that's in conflict with domain-general learning; the less I have to specify about the hypothesis space the more domain-general I am." Just about everybody's got this intuition when they start out, until you realize that in order to study learning you have to be explicit no matter what; should we really be interpreting the algorithmic complexity of the specification of the hypothesis space?

      Delete
    2. Well not in general, but in the context of a debate about nativism I think it is important, because it is a better measure of what has to have evolved than the information theoretic measures.

      So say we posit that a human baby has a Bayesian learner with a prior P in, we want some way of measuring how complex P is, so we can get a handle on what Norbert would call Darwin's problem.
      It seems like the algorithmic complexity is a better measure for these purposes than the entropy of the prior?

      I don't know whether this actually makes sense technically.

      Delete
    3. There might be a misleading aspect to Darwin's problem, in that, if the language faculty emerged by the glomming together of several preexisting facilities, then some of these might have brought with them complexities produced by adaptation to their previous environment. The signal for this would be recurrent grammatical complexitied not explainable by functional or historical factors.

      Delete
    4. I think there is more than one misleading aspect to Darwin's problem. The most obvious is the name: we really should call it for what it is: Chomsky's problem. It makes no sense to say "the language faculty emerged by glomming together several preexisting faculties" unless we have a fairly detailed description of the language faculty. To the best of my knowledge no such description exists anymore. Further, the most vigorous defenders of Chomsky's story [e.g., on this blog Norbert our kind host], insist that the evolution of LF happened in one [or maybe more than one] miracle mutation - so no glomming together but something arose de novo in its full glory [the part you call above not explainable by functional or historical factors - because for those we need no miracles]. That part is what gives R the edge over E in Norbert's story [and without it there is not much edge for R left].

      For this reason I agree with Alex that all we can do at the moment is 'positing' things that are at least not impossible [like infinity generators required in some stories]. But I share also Alex's concern about whether it makes sense technically. I think any R who wants to be taken seriously has to come up with a fairly detailed story about what IS part of the LF. Only then can we begin to worry about Darwin's problem. Or about what kind of hypothesis space is required and whether Bayesian approaches are helpful...

      Delete
    5. Alex:
      "It seems like the algorithmic complexity is a better measure for these purposes than the entropy of the prior?

      I don't know whether this actually makes sense technically"

      OK, yes, I have the same feeling - on both counts...

      Delete
  5. It makes perfectly good sense, it's just kind of vague and speculative. The E position is also rather speculative in the absence of E-learners or (weaker) nonconstructive specifications of descriptively adequate grammars from realistic types and amounts of days.

    The Kwiatkowski et al CCG learner is an interesting case, it is somewhat R-ish in that CCG imposes some restrictions on the nature of the form-meaning relspationship, and seems to learn an impressive amount of stuff (but how would it fare on Icelandic, Kayardild or Dinka)?

    ReplyDelete
    Replies
    1. Ideally what we would have is an implemented computer program and large corpura of child directed speech in Kayardild and Martuthunira and Dyirbal and Yoruba and so on all annotated with their meanings, and we would feed in the data and out would pop a descriptively adequate grammar; but we don't have that for either R-learners or E-learners.
      But I think for E-learners, namely distributional learners of the type that I have been working on, we have all of the parts that such a learner would be composed of and proofs that they work efficiently (in the computational sense) on classes of languages that seem to include Swiss German, and Kayardild and Old Georgian and so on. These are based on query-learning, but there are quite a few papers now showing how queries can be replaced with probabilistic data.
      Whereas for E-learners at the moment, there is not even a sketch of how the whole process is meant to work, unless you take the P and P route, which is the antithesis of the MP approach.

      (CCG is too weak to model Suffixaufnahme though it can of course do cross-serial dependencies in Swiss German.
      I don't think the CCG learner can learn non CFGs though, as the parser is based on a CFG reduction I think, but I would have to check with Mark S).

      Delete
    2. But is CCG E or R? It embodies a definite take on how languages are organized, which seems to me to have something to do with why the learner works.

      Delete
    3. @ Alex: I think you meant R-learner in the below:

      Whereas for E-learners at the moment, there is not even a sketch of how the whole process is meant to work, unless you take the P and P route, which is the antithesis of the MP approach.

      @ Avery: you say: It makes perfectly good sense, it's just kind of vague and speculative.

      But that is exactly the problem, is it not: after 60+ years intense research with progress being made all the time it should not be that vague any more. In fact 'it' was a lot more specific 30 years ago. At one point R theorists have to put their cards on the table and reveal WHAT is innate in terms that are not vague and speculative. Same for saying a certain learner is R-ish: it has been admitted decades ago that all proposals assume some R-ishness. The kittens-and-rocks blank-slate defender does not exist outside Chomsky's [and McGilvray's] fantasy.

      What is needed are specific proposals that can be tested with child-language data and checked for biological realizability. So for example having "an implemented computer program and large corpora of child directed speech i... and out would pop a descriptively adequate grammar" may be a great accomplishment for either E or R theorists. But as long as said computer program could not 'run' on human wetware it tells us little about how kids acquire language.

      What is definitely not needed is any more dogmatism about one approach having to be right because the other has not accounted for X yet. [especially if the dogmatist cannot account for X either in a way that's not vague and speculative]. I think this R against E approach that Norbert wants to hang on to really not helpful at all [maybe it was 40 years ago but most people have moved on] and think only researchers who are willing to have an open mind about how much R [or E] might be needed are likely to get somewhere...

      Delete
    4. This comment has been removed by the author.

      Delete
    5. Yes that's right -- I meant:

      "Whereas for R-learners at the moment, there is not even a sketch of how the whole process is meant to work, unless you take the P and P route, which is the antithesis of the MP approach."

      Delete
  6. If your concern is that Bayesians might be "too E" and this troubles you, I think you should rest assured that they can be as "R" as you like (although in practice they may not be). Let me try to give a summary of my "big tent" Bayesian view. Bayesian statistics is more than happy to have richly structured hypotheses. As Charles Yang points out in the post that prompted this, his probabilistic parameter setting work has a Bayesian interpretation, and it certainly adheres strongly to traditional notions of syntax. Bayesian reasoning says just two things: that beliefs and data are treated as the some "kind" of things (statistical objects characterized by distributions); and that beliefs are updated in a particular way given evidence: in particular, such that the probability of a particular "posterior" belief is proportional to your prior belief in its correctness times the likelihood of the evidence assuming the particular belief. To the extent that you would want to claim that language learning (or any other kind of learning) is not "Bayesian", it would necessarily have to be a claim about one of these things being false. And, to the extent that observed behavior deviates from these ideals, the standard arguments with respect to "ideal learners" vs. "actual learners" are still worth keeping in mind.

    Finally, one might also like Bayesian reasoning as a strategy that has a certainly friendliness to Darwin's problem. In decision theory, the Bayesian decision rule (making the minimum expected error decision) is an optimal strategy. To the extent that some other decision rule (or non-Bayesian inference procedure) would come to exist, it would be out-competed by Bayesians, all other things being equal.

    ReplyDelete
  7. I think there is a more general objection which is that these models are "learning" models rather than "acquisition" models. That is to say, Bayesian models use information rationally to select the appropriate grammar, or update the posterior probability, rather than having the grammar be triggered in some sub-rational brute causal way. Some people (Chomsky? Norbert? Paul Pietroski?... I don't know ) have a very strong conviction that language is not "learned" in this sense, but rather "grows" in some other way,
    in a process analogous to the biological development of an organ etc etc. I guess you know the rhetoric.




    ReplyDelete
    Replies
    1. That's a fair point, although I wouldn't immediately assume that Bayesian "dynamics" don't characterize growth or "aquisition" processes. But, to the extent though that there are strongly rationalist "learning" models, I think Bayes would be in principle completely compatible with them (whether the Bayesian Hypothesis is right or wrong).

      Delete
  8. In spite of Norbert's qualms, I think generativists ought to be enthusiastic about Bayes, because it means that there has to be a prior, which on current knowledge will either have to take the form of a finite collection of parameters or some kind of rule/constraint notation with something like an evaluation metric defined over it, answering the perennial question of typological-functionalists and indolent undergraduates: what's the point of putting the grammars into a definite formal framework?

    The fact that some Bayesians are pursuing very lean hypotheses about the content of UG should not be a worry, because if they're inadequate, this will become demonstrable in due course (I don't think it is yet).

    ReplyDelete