Tuesday, April 21, 2015

Bayes and Gigerenzer

Gigerenzer identifies an important assumption made in “rational” theories of cognition, of which Bayes is one variety. He calls it, following Carnap, the principle of full information (see here and here for discussion). I’ll call it “Carnap’s Principle” (CP). CP is one of the features that makes rational theories of cognition rational. How? Well, CP requires taking all the relevant data into account. Not some of it, not the stuff you might like, but all of it, even the troublesome bits. Agents are rational, at least in an ideal Level-1 Marr’s sense, to the degree that they cleave to CP.[1] Of course, this is not always doable, for the best may be unachievable, demanding resources (memory, computational speed) that are biologically/cognitively unavailable. Theories of constrained optimization accommodate this fact by considering possible ways of approximating the Level-1 “best” solution given the actual computational resources at hand. The spirit of CP (take account of all the relevant data) is respected in algorithms that take account of all the relevant information it is computationally possible to take account of. The way this gets into Bayes stories is that first we try to figure out what would be best given all the data and then superimpose algorithms on this that respect the resource problems; algorithms being “better” the closer they approximate the ideal without being too resource demanding. Put another way, Carnap’s full information principle is part of the theory of competence, and resource constraints are part of the theory of performance. Performance approaches the ideal to the degree that it respects CP.[2]

Gigerenzer observes that constrained optimization approaches that incorporate CP as an ideal support an interesting counterfactual: were there no cost to using all the data, then using all the available data would be best (‘best’ here means would result in better (best?) outcomes). Or, to put this another way, ignoring information exacts a cost that would be mitigated were all the information actually used. In a word, the rational thing to do is also the best thing to do were we but able to do it, but, sadly, because of resource constraints we cannot do what’s rational, which is why we seem not to be acting optimally when we behave (see here for one exposition). 

Gigernezer challenges CP. He argues that there are times when using all the available  information leads to worse outcomes than ignoring some of it does. One might describe this in a way that heightens paradox as follows: acting rationally can be counter-productive or studied ignorance can be empowering. On this view, there are times when the “best” theory is one that ignores relevant data in making its calculations. And ‘best’ here means that using all the data even when so using it is NOT computational onerous systematically yields worse results than using only part of it. In other words, Gigerenzer’s point is that there are situations where abstracting from resource costs, it is more rational to use less data than all of the data if by ‘rational’ one means gets better results.

Given this possibility, there are two kinds of situations to consider which result in two different kinds of “failure” to achieve an outcome: (i) failure resulting from not using all the information and (ii) failure resulting from actually using all the relevant information. The world is a complicated place it seems and the best strategy intimately depends on the circumstances.

It’s worth noting that CP is an add-on to Bayes. By this I mean that one can use Bayesian methods both over an informationally restricted domain or over one that includes all the data.  Bayesian methods are agnostic wrt this (though CP helps underwrite the rationality assumption that is part of Bayes methodological justification (the rational analysis meme)).  The question then is really one about the right characterization of the relevant data: which should be used and which ignored to get the best results. Rational theories assume that all usable data should in fact be used. Less is NEVER more! Gigernezer disagrees.

Gigerenzer has offered several empirical examples that back this reasoning up. What I want to consider here are whether there any linguistiky examples? Here I offer for your delectation two that seem to illustrate Gigerenzer’s observations.

I reviewed one a while ago when discussing Medina, Snedeker, Trueswell & Gleitman’s model of word learning and a modest revision of this model by Stevens, Yang, Trueswell and Gleitman (SYTG) (here, here, and here). A key feature of these models is that they reject “cross situational learning,” this being “the tabulation of multiple, possibly all, word-meaning associations across learning instances.” In other words, a key feature of these models is that they did very many fewer calculations than they could have done. They thus exploited a vey reduced subset of the relevant information available. Usefully, SYTG compares the results of richer uses of cross situational learning with their model, which makes use of a very restricted subset of the available information. The interesting result is that the models that more approximate the CP ideal do worse than the one that is far more myopic. In other words, less is more in this particular case, suggesting that even were one able to compute all the relevant alternatives (though it should be noted that it’s very computationally costly to do so), it would be ill-advised for it leaves the learner worse off. The problem then is not that the computation is too hard to do (though it is that as well), but that doing it even were it computationally cheap offers worse outcomes than doing a more half-ass job.[3] Let’s hear it for sloth!

I’ve recently read another illustration of this logic, this time as pertains to learning word segmentation (here). The paper is by Kasia Hitczenko (currently at UMD, yeah!!) and Gaja Jarosz (at Yale) (H&J). This was Kascia’s UG thesis work (and yes, they young are more talented than we were when we were their age. Thank the lord that I don’t have to compete for grad school admission now). At any rate, the paper compares two methods of word segmentation acquisition; ideal versus constrained learner models. The key is that the former abstract from memory limitations that the latter incorporate.

H&J investigate two models of word segmentation acquisition that incorporate the same probability model but differ in that one is incremental (and hence more realistic) and consequently executes a more limited set of computations. In particular, in the batch learner, the estimation procedure is “calculated over the whole corpus, while in the incremental algorithm, it is calculated over individual utterances” (3). The former calculation is far more extensive in that it evaluates goodness of fit over all the possible information in the corpus, unlike the incremental learner, which does a less thorough set of comparisons. Surprisingly, the more limited incremental learner does much better than the more rational batch learner. Moreover, and this is the important thing that H&J shows, what makes the incremental learner better is its memory limitations (i.e. the fact that unlike the batch learner it does not make all the comparisons it is possible to make). H&J demonstrate this by adding two kinds of memory limitations to the batch learner model and then showing that with these restrictions in place (forcing the batch learner to do a less thorough job), the batch learner does as well as the incremental learner with regard to segmentation. In other words, it appears that the ideal learner does less well than a more myopic one but not because of resource constraints. Rather, as H&J says concerning the batch models: “providing this model with access to all of the available information prevents it from segmenting accurately.” This is a fact about the segmentation procedure not about resource limitations.

This last point is important. It is common knowledge that ideal learning is often not feasible. The calculations are just too intractable. However, here, as in the earlier word learning models discussed above, the problem is not computational cost. The result is that actually doing the fuller computation leaves the learner worse off than not doing it all. In other words, abstracting away from computational cost, the more “rational” model, the one closer to endorsing CP, does worse than the more limited one. Thus, this is another example, in the linguistic domain, where less really is more.

This result is of interest not only to Bayesians, but to those like me who’ve always liked thinking of acquisition problems in terms of ideal speaker hearers (ISH).  ISHs have perfect unboundedly capacious memories and are able to take all the data in at once (instantaneous learning). What the results above seem to indicate is that ISH assumptions might actually make acquisition harder, as disregarding some of the data may be required to acquire a language at all. And this is interesting, for it suggests that for some matters (e.g. when statistical calculations become important), the ISH idealization might be misleading precisely because it incorporates Carnap’s Principle as an ideal.

We all know that idealizations are literally speaking false in that they abstract away from what we know to be many (possibly important) factors. What Gigerenzer has pointed out is that they can be false in two ways, one more benign than the other, IMO.

The first way is that they identify abstract away from resource constraints that we know to be important. LADs do not have perfect arbitrarily capacious memories. We know this. And thus real learners might fail to realize the properties we ascribe to ideal learners for precisely this reason. The utility of the idealization remains, however, as we assume that the closer the real learner’s resources approach the ideal conditions, the closer the real learner’s acquisition gets to the ideal limit case.

The second way to fail is to misunderstand that less can be more; in particular that having resource limitations may be important factors in making acquisition possible  (or at least much easier). This is less intuitive for it suggests that there is a resource curse; you can have too much of a good thing.

So, it seems that Gigerenzer’s observations have empirical utility, at least in the domain of word acquisition. So next time you are forgetful, remember there may be an up side. Given my current early onset proclivities, I sure hope so.


[1] Your favorite neighborhood Bayesian makes this assumption, which is one of the reasons that it is a species of rational analysis.
[2] I should add here, that this is not the way I think that linguists use, or ought to use, the competence/performance distinction. Competence is not some ideal target that performance tries to hit. Rather, competence theories are theories of what a cognizer knows and performance theories are accounts of how this knowledge is put to use. What one knows is not in any reasonable sense a “target” that performance theories aim to hit.
[3] The way that SYTG argues for this is that they actually do the costly computation and show that it is less successful than the one that eschews cross situational learning. So, even though the CP compliant computations are indeed costly, this is not the reason that they are less successful in the word learning conditions explored. Of course, that they are computationally expensive is another reason to reject them. But it is an additional reason. It’s the counterfactual that is Gigerenzer interesting.

7 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. It seems to me that there are two meanings of "ideal learner" that we might want to distinguish. It looks like Hitczenko and Jarosz are using this word to mean "has no resource limitations"; in other contexts it is used to mean "performs the task in an optimal way". Clearly resource limitations typically impair the learner's ability to perform optimally, but the other direction isn't true - models without any resource limitations may still not perform "optimally", if by that you mean generalize well to new data. The extreme example is a learner that simply memorizes the data and doesn't generalize at all.

    ReplyDelete
  3. Essentially, if your learning algorithm performs better if you ignore some of the training data than if you remember all of it, I don't see how that algorithm could be ideal.

    ReplyDelete
    Replies
    1. Not sure I understand what you are driven at. The Gigernezer point, if I understand him, is that there are seem circumstances where less really is more. He zeros in on Carnap's Principle as the assumption that denies this. It expresses the view that less is never more. We can massage this view to allow in what we might call Simon's Principle: there are costs to more so these need to be taken into account. This leaves us that less is never more subject to computational costs. What is interesting is that there are cases where we can hold these costs constant and see whether doing more always leads to better measurable outcomes. Carnap's principle is then the thesis that this is impossible. Well, the cases I mentioned seem to indicate that the impossible happens, i.e. that less can do better than more. This I take to be an illustration of Gigerenzer's point. Now, your first comment seems to be making the same point in another way. Good. As for your second, I'm not sure what you intend. Is the suggestion terminological? BTW, I used the term "ideal" in the context of the "ideal speaker-hearer." This was used as an example of a learner not burdened by serious resource constraints. I don't think that the people I mentioned used the term. I am happy to drop it if that is what you are pointing to. So, I am confused or in total agreement. Enlighten me.

      Delete
    2. Hitczenko and Jarosz refer to their learner as an ideal learner. What I was saying is that I wouldn't consider an algorithm that learns better from less data to be an ideal learning algorithm. I haven't read either Carnap or Gigerenzer, so I can't really comment on that part.

      Delete
    3. The issue with the "ignoring some of the relevant information is beneficial"-story is that there is no objective way of characterizing "all the relevant information" to begin with. Paying undue attention to stuff that is irrelevant to the problem at hand would be expected to lead to bad performance, so at least to me, the most plausible interpretation of all the "constrained-outperforms-unconstrained"-learner ought to be "our characterization of 'all relevant data' is off, in fact, it includes misleading information (and it might very well miss relevant information); let's make the characterization better".

      Take the standard Batch-Bayes-bashing - it's not that doing batch processing really hurts, it is that the model that is being assumed is insufficient, e.g. in that it assumes words to not depend on their surrounding context; and because of this, it mis-characterizes "all relevant information". If we make this characterization "less wrong", there's no "less is more" anymore, batch wins against incremental any day of the week and yields, at least as of now, the "best" results (now how to evaluate algorithms is a whole other can of worms, of course, though this is really orthogonal to the issue at hand).

      This is not to say that there is nothing wrong with Batch processing as a psychological proposal about what humans do. But the batch algorithms were never supposed as hypotheses about these mechanisms. They are tools to help us identify a plausible _model_ that correctly characterizes "the relevant data".

      I guess I just don't find the sticking-with-the-model-and-trying-out-algorithms-until-it-works-plus-"you just have to _not_ use all the relevant data"-approach very convincing. It strikes me as analogous to replacing a set of valid inference rules with some totally random assortment of heuristics because, surprise, starting from false premises even valid inference rules can lead to false conclusions; whereas this other set of random rules, at least this one time, from the exact same (false) premises, led to slightly less wrong conclusions.

      Delete
    4. Fair enough. However, I don't really think that the questions hold be terminological. At any rate, as usual, Charles' recent post does a far better job that I could do in making some of the issues of interest salient.

      Delete