Gigerenzer identifies an important assumption made in
“rational” theories of cognition, of which Bayes is one variety. He calls it,
following Carnap, the principle of full information (see
here and here
for discussion). I’ll call it “Carnap’s Principle” (CP). CP is one of the
features that makes rational theories of cognition rational. How? Well, CP
requires taking all the relevant data
into account. Not some of it, not the stuff you might like, but all of it, even
the troublesome bits. Agents are rational, at least in an ideal Level-1 Marr’s
sense, to the degree that they cleave to CP.[1]
Of course, this is not always doable, for the best may be unachievable, demanding
resources (memory, computational speed) that are biologically/cognitively
unavailable. Theories of constrained optimization accommodate this fact by considering
possible ways of approximating the Level-1 “best” solution given the actual
computational resources at hand. The spirit of CP (take account of all the
relevant data) is respected in algorithms that take account of all the relevant information it is
computationally possible to take
account of. The way this gets into Bayes stories is that first we try to figure
out what would be best given all the data and then superimpose algorithms on
this that respect the resource problems; algorithms being “better” the closer
they approximate the ideal without being too resource demanding. Put another
way, Carnap’s full information principle is part of the theory of competence,
and resource constraints are part of the theory of performance. Performance
approaches the ideal to the degree that it respects CP.[2]
Gigerenzer observes that constrained optimization approaches
that incorporate CP as an ideal support an interesting counterfactual: were there no cost to using all the
data, then using all the available data would
be best (‘best’ here means would result in better (best?) outcomes). Or, to
put this another way, ignoring information exacts a cost that would be
mitigated were all the information actually used. In a word, the rational thing to do is also the best thing to do were we but able to do
it, but, sadly, because of resource constraints we cannot do what’s rational, which
is why we seem not to be acting optimally when we behave (see here for one
exposition).
Gigernezer challenges CP. He argues that there are times
when using all the available information
leads to worse outcomes than ignoring some of it does. One might describe this
in a way that heightens paradox as follows: acting rationally can be
counter-productive or studied ignorance can be empowering. On this view, there
are times when the “best” theory is one that ignores relevant data in making its calculations. And ‘best’ here means
that using all the data even when so using it is NOT computational onerous systematically
yields worse results than using only part of it. In other words, Gigerenzer’s
point is that there are situations where abstracting
from resource costs, it is more rational
to use less data than all of the data if by ‘rational’ one means gets better
results.
Given this possibility, there are two kinds of situations to
consider which result in two different kinds of “failure” to achieve an
outcome: (i) failure resulting from not
using all the information and (ii)
failure resulting from actually using
all the relevant information. The world is a complicated place it seems and the
best strategy intimately depends on the circumstances.
It’s worth noting that CP is an add-on to Bayes. By this I
mean that one can use Bayesian methods both over an informationally restricted
domain or over one that includes all the data.
Bayesian methods are agnostic wrt this (though CP helps underwrite the
rationality assumption that is part of Bayes methodological justification (the
rational analysis meme)). The question
then is really one about the right characterization of the relevant data: which should be used and which ignored to get the
best results. Rational theories assume that all usable data should in fact be
used. Less is NEVER more! Gigernezer disagrees.
Gigerenzer has offered several empirical examples that back
this reasoning up. What I want to consider here are whether there any
linguistiky examples? Here I offer for your delectation two that seem to
illustrate Gigerenzer’s observations.
I reviewed one a while ago when discussing Medina, Snedeker,
Trueswell & Gleitman’s model of word learning and a modest revision of this
model by Stevens, Yang, Trueswell and Gleitman (SYTG) (here,
here,
and here).
A key feature of these models is that they reject “cross situational learning,”
this being “the tabulation
of multiple, possibly all, word-meaning associations across learning instances.”
In other words, a key feature of these models is that they did very many fewer calculations than they could have
done. They thus exploited a vey reduced subset of the relevant information
available. Usefully, SYTG compares the results of richer uses of cross situational learning with their model, which
makes use of a very restricted subset of the available information. The interesting
result is that the models that more approximate the CP ideal do worse than the one that is far more
myopic. In other words, less is more in this particular case, suggesting that
even were one able to compute all the relevant alternatives (though it should
be noted that it’s very computationally costly to do so), it would be
ill-advised for it leaves the learner worse off. The problem then is not that
the computation is too hard to do (though it is that as well), but that doing it even were it computationally cheap offers worse outcomes than doing
a more half-ass job.[3]
Let’s hear it for sloth!
I’ve recently read another illustration of this logic, this
time as pertains to learning word segmentation (here). The paper is by
Kasia Hitczenko (currently at UMD, yeah!!) and Gaja Jarosz (at Yale) (H&J).
This was Kascia’s UG thesis work (and yes, they young are more talented than we
were when we were their age. Thank the lord that I don’t have to compete for
grad school admission now). At any rate, the paper compares two methods of word
segmentation acquisition; ideal versus constrained learner models. The key is
that the former abstract from memory limitations that the latter incorporate.
H&J investigate two models of word segmentation
acquisition that incorporate the same probability model but differ in that one
is incremental (and hence more realistic) and consequently executes a more
limited set of computations. In particular, in the batch learner, the
estimation procedure is “calculated over the whole corpus, while in the
incremental algorithm, it is calculated over individual utterances” (3). The
former calculation is far more extensive in that it evaluates goodness of fit
over all the possible information in the corpus, unlike the incremental learner,
which does a less thorough set of comparisons. Surprisingly, the more limited
incremental learner does much better than the more rational batch learner.
Moreover, and this is the important thing that H&J shows, what makes the
incremental learner better is its memory limitations (i.e. the fact that unlike
the batch learner it does not make all the comparisons it is possible to make).
H&J demonstrate this by adding two kinds of memory limitations to the batch
learner model and then showing that with these restrictions in place (forcing
the batch learner to do a less thorough job), the batch learner does as well as
the incremental learner with regard to segmentation. In other words, it appears
that the ideal learner does less well than a more myopic one but not because of resource constraints.
Rather, as H&J says concerning the batch models: “providing this model with
access to all of the available information prevents it from segmenting
accurately.” This is a fact about the segmentation procedure not about resource
limitations.
This last point is important. It is common knowledge that
ideal learning is often not feasible. The calculations are just too
intractable. However, here, as in the earlier word learning models discussed
above, the problem is not computational cost. The result is that actually doing
the fuller computation leaves the learner worse off than not doing it all. In
other words, abstracting away from
computational cost, the more “rational” model, the one closer to endorsing CP,
does worse than the more limited one. Thus, this is another example, in the
linguistic domain, where less really is more.
This result is of interest not only to Bayesians, but to
those like me who’ve always liked thinking of acquisition problems in terms of
ideal speaker hearers (ISH). ISHs have
perfect unboundedly capacious memories and are able to take all the data in at
once (instantaneous learning). What the results above seem to indicate is that
ISH assumptions might actually make acquisition harder, as disregarding some of the data may be required to acquire
a language at all. And this is interesting, for it suggests that for some
matters (e.g. when statistical calculations become important), the ISH
idealization might be misleading precisely because it incorporates Carnap’s Principle
as an ideal.
We all know that idealizations are literally speaking false
in that they abstract away from what we know to be many (possibly important)
factors. What Gigerenzer has pointed out is that they can be false in two ways,
one more benign than the other, IMO.
The first way is that they identify abstract away from
resource constraints that we know to be important. LADs do not have perfect
arbitrarily capacious memories. We know this. And thus real learners might fail
to realize the properties we ascribe to ideal learners for precisely this reason.
The utility of the idealization remains, however, as we assume that the closer
the real learner’s resources approach the ideal conditions, the closer the real
learner’s acquisition gets to the ideal limit case.
The second way to fail is to misunderstand that less can be
more; in particular that having resource limitations may be important factors
in making acquisition possible (or at
least much easier). This is less intuitive for it suggests that there is a
resource curse; you can have too much of a good thing.
So, it seems that Gigerenzer’s observations have empirical
utility, at least in the domain of word acquisition. So next time you are
forgetful, remember there may be an up side. Given my current early onset
proclivities, I sure hope so.
[1]
Your favorite neighborhood Bayesian makes this assumption, which is one of the
reasons that it is a species of rational analysis.
[2]
I should add here, that this is not
the way I think that linguists use, or ought to use, the competence/performance
distinction. Competence is not some ideal target that performance tries to hit.
Rather, competence theories are theories of what a cognizer knows and
performance theories are accounts of how this knowledge is put to use. What one
knows is not in any reasonable sense a “target” that performance theories aim
to hit.
[3]
The way that SYTG argues for this is that they actually do the costly
computation and show that it is less successful than the one that eschews cross
situational learning. So, even though the CP compliant computations are indeed
costly, this is not the reason that
they are less successful in the word learning conditions explored. Of course,
that they are computationally expensive is another reason to reject them. But
it is an additional reason. It’s the counterfactual that is Gigerenzer
interesting.
This comment has been removed by the author.
ReplyDeleteIt seems to me that there are two meanings of "ideal learner" that we might want to distinguish. It looks like Hitczenko and Jarosz are using this word to mean "has no resource limitations"; in other contexts it is used to mean "performs the task in an optimal way". Clearly resource limitations typically impair the learner's ability to perform optimally, but the other direction isn't true - models without any resource limitations may still not perform "optimally", if by that you mean generalize well to new data. The extreme example is a learner that simply memorizes the data and doesn't generalize at all.
ReplyDeleteEssentially, if your learning algorithm performs better if you ignore some of the training data than if you remember all of it, I don't see how that algorithm could be ideal.
ReplyDeleteNot sure I understand what you are driven at. The Gigernezer point, if I understand him, is that there are seem circumstances where less really is more. He zeros in on Carnap's Principle as the assumption that denies this. It expresses the view that less is never more. We can massage this view to allow in what we might call Simon's Principle: there are costs to more so these need to be taken into account. This leaves us that less is never more subject to computational costs. What is interesting is that there are cases where we can hold these costs constant and see whether doing more always leads to better measurable outcomes. Carnap's principle is then the thesis that this is impossible. Well, the cases I mentioned seem to indicate that the impossible happens, i.e. that less can do better than more. This I take to be an illustration of Gigerenzer's point. Now, your first comment seems to be making the same point in another way. Good. As for your second, I'm not sure what you intend. Is the suggestion terminological? BTW, I used the term "ideal" in the context of the "ideal speaker-hearer." This was used as an example of a learner not burdened by serious resource constraints. I don't think that the people I mentioned used the term. I am happy to drop it if that is what you are pointing to. So, I am confused or in total agreement. Enlighten me.
DeleteHitczenko and Jarosz refer to their learner as an ideal learner. What I was saying is that I wouldn't consider an algorithm that learns better from less data to be an ideal learning algorithm. I haven't read either Carnap or Gigerenzer, so I can't really comment on that part.
DeleteThe issue with the "ignoring some of the relevant information is beneficial"-story is that there is no objective way of characterizing "all the relevant information" to begin with. Paying undue attention to stuff that is irrelevant to the problem at hand would be expected to lead to bad performance, so at least to me, the most plausible interpretation of all the "constrained-outperforms-unconstrained"-learner ought to be "our characterization of 'all relevant data' is off, in fact, it includes misleading information (and it might very well miss relevant information); let's make the characterization better".
DeleteTake the standard Batch-Bayes-bashing - it's not that doing batch processing really hurts, it is that the model that is being assumed is insufficient, e.g. in that it assumes words to not depend on their surrounding context; and because of this, it mis-characterizes "all relevant information". If we make this characterization "less wrong", there's no "less is more" anymore, batch wins against incremental any day of the week and yields, at least as of now, the "best" results (now how to evaluate algorithms is a whole other can of worms, of course, though this is really orthogonal to the issue at hand).
This is not to say that there is nothing wrong with Batch processing as a psychological proposal about what humans do. But the batch algorithms were never supposed as hypotheses about these mechanisms. They are tools to help us identify a plausible _model_ that correctly characterizes "the relevant data".
I guess I just don't find the sticking-with-the-model-and-trying-out-algorithms-until-it-works-plus-"you just have to _not_ use all the relevant data"-approach very convincing. It strikes me as analogous to replacing a set of valid inference rules with some totally random assortment of heuristics because, surprise, starting from false premises even valid inference rules can lead to false conclusions; whereas this other set of random rules, at least this one time, from the exact same (false) premises, led to slightly less wrong conclusions.
Fair enough. However, I don't really think that the questions hold be terminological. At any rate, as usual, Charles' recent post does a far better job that I could do in making some of the issues of interest salient.
Delete