Last week I had two Bayes moments: the first was a vigorous
discussion of the Jones and Love paper with computational colleagues with a CS
mind set and the second was a paean delivered by Stan Dehaene in his truly
excellent Baggett Lectures this year (here). These conversations were quite
different in interesting ways.
My computational colleagues, if I understood them correctly
(and this is a very open question) saw Bayes as providing a general formal
framework useful to the type of problems cognitive neuroscientists encounter.
Importantly, on this conception, Bayes is not an empirical hypothesis about how
minds/brains compute but an empirically neutral notation whose principle virtue
is that allows you to state matters more precisely than you can in simple
English. How so? Well minds/brains are complicated and having a formal
technology that can track this complexity in some kind of normal form (as Bayes
does) is useful. On this view, Bayes has all the virtues (and charms?) of
double entry bookkeeping.[1]
On this view, every problem is amenable to a Bayesian analysis of some kind so
there is no way for a Bayes approach as
such to be wrong or inapposite, though particular proposals can be better
or worse. In short, Bayes so considered is more akin to C++ (an empirically
neutral programming language) than to Newton’s mechanics (a description of real
world forces).
There is a contrasting view of Bayes that is prevalent in
some parts of the cog-neuro world. Here Bayes is taken to have empirical
content. In contrast to the first view, its methods can be inappropriate for a
given problem and a Bayes approach as
such can even be wrong if it can be shown that its design requirements are
not met in a given problem domain. On
this view, Bayes is understood to be a description of cog-neuro mechanisms and describing a system as
Bayesian is to attribute to that system certain distinctive properties.
These two conceptions cannot be more different. And I
suspect that the move between these two conceptions has muddied the conceptual
landscape. Empirically minded cognitive
neuroscientists are attracted to the second conception for it commits empirical
hostages and makes serious claims (I review some of these below). CS types seem
attracted to the first conception precisely because it provides a general
notation for dealing with any computational problem and it does so by being
general and thus bereft of any interesting empirical content.[2]
Whichever perspective you adopt, it’s worth keeping them separate.
I just read a very good exposition of the second conception
of Bayes-as-mechanism by O’Reilly, Jbabdi and Behrens (OJB) (here).[3]
OJB is at pains (i) to show what the basic characteristics of a Bayes system
are, (ii) to illustrate successful cases where the properties so identified
gain explanatory purchase within cog-neuro and (iii) to show where the
identified properties are not useful in explaining what is going on. Step (iii)
illustrates that OJB takes Bayes to making a serious empirical claim. Here are
some details, though I recommend that you read the paper in its entirety as it
offers a very good exposition of why neuroscientists have become interested in
Bayes, and it’s not just because of (or even mainly due to) its computational
explicitness.
OJB identifies the following characteristics of a Bayesian
system (BS):
1. BSs
represent quantities in terms of probability density functions (PDFs). These
represent an observer’s uncertainty about a quantity (1169).
2. BSs
integrate information using precision
weighting (1170). This means that the information is combined sensitive to
their relative reliability (as measured probabilistically): “It is a core
feature of Bayesian systems that when sources of information are combined, they
are weighted by their relative reliability (1171).”
3. BSs
integrate new information and prior information according to their relative
precisions. This is analogous to what occurs when several sources of new
information are combined (e.g. visual and haptic info). As OJB puts it: “the
combined estimate is partway between the current information and the prior,
with the exact position depending on the relative precision of the current
estimate and the prior (1171).”
4. BSs
represent the values of all parameters in the model jointly, viz. “It is a
central characteristic of fully Bayesian models that they represent the full
state space (i.e. the full joint probability distribution across all
parameters) (1171).”
In sum: a BS represents info as PDFs, does precision
weighted integration of old and new info, and fully represents and updates all
info in the state space. If this is what a BS is, then there are several
obvious ways that a given proposal can be non-BS. A useful feature of OJB is
that it contrasts each of these definitional features with a non-BS
alternative. Here are some ways that a proposed system can be non-BS.
·
It represents quantities as exact (contra 1).
For example, in the MSTG paper here,
learners represented their word knowledge with no apparent measure of
uncertainty in the estimate.
·
It combines info without sensitivity to its
reliability (contra 2). Thus, e.g. in the MSTG paper information is not combined probabilistically by
considering the weighting of the prior and input. Rather contrary info leads
one to drop the old info completely and arbitrarily choose a new candidate.
·
It uses a truncated parameter spaces or does not
compute values for all alternatives in the space. Again, in MSTG paper the
relevant word-meaning alternatives are not updated at all as only one candidate
at a time is attended to.
The empirical Bayes question then is pretty straightforward
conceptually: to what degree do various systems act in a BS manner and when a
system deviates along one of the three dimensions of interest, how serious a
deviation is it? Once again, OJB offer useful illustrations. For example, OJB
notes that multi-sensory integration looks very BSish. It represents incoming
info PDFly and does precision weighted integration of the various inputs. Well
almost. It seems that some modalities might be weighted more than “is optimal”
(1172). However, by and large, the BS model the central features of how this
works. Thus, in this case, the BS model is a reasonable description of the
relevant mechanism.
There are other cases where the BS idealization is far less
successful. For example, it is well known that “adding parameters to a model
(more dimensions to the model) increases the size of the state space, and the
computing power required to represent and update it, exponentially (1171).”
Apparently problems arise even when there are only “a handful of dimensions of
state spaces” (1175). Therefore, in many cases, it seems that behavior is
better described by “semi-Bayesian models,” (viz. with truncated state spaces)
or “non-Bayesian models” (viz. in which some parameters are updated and some
ignored) (1175). Or models in which
“variance-blind heuristics” substitute for precision weighted integration or
“rather than optimizing learning by integrating information over several trials,
participants seem to use only one previous exemplar of each category to
determine its ‘mean’ (1175).”
OJB describe various other scenarios of interest all with
the same aim: to show how to take Bayes seriously as a substantive piece of
cog-neuro science. It is precisely because not everything is Bayesian that arguing that a mechanism is Bayes that one might get some
explanatory insight from the classification. OJB take Bayes to be a useful description
for some class of mechanisms, ones
with the three basic characteristics noted above: PDF representations,
precision weighted integration and fully jointly specified state space.
OJB points out one further positive feature of taking Bayes
in this way: you can start looking for neural mechanisms that execute these
functions, e.g. how neurons or populations of neurons might allow for PDF like
representations and their integrations. In other words, a mechanistic
interpretation of Bayes leads to an understandable research program, one with
real potential empirical reach.
Let me end here. I understand the OMB version of Bayes. I am
not sure how much it describes linguistic phenomena, but that is an empirical
question, and not one that I will be well placed to adjudicate. However, it is
understandable. What is less understandable are versions that do not treat
Bayes as a hypothesis about mental/neural mechanisms. If Bayes is not this, then why should we care? Indeed,
what possible reason can there be in not taking Bayes in this mechanistic
way? Is it the fear that it might be
wrong so construed? Is being wrong so
bad? Isn’t it the aim of cog-neuro to
develop and examine theories that could be wrong? So, my question to
non-mechanistic Bayesians: what’s the value added?
[1]
This sounds more dismissive than it should perhaps. The invention of double
entry bookkeeping was a real big deal and if Bayes serves a similar function,
then it is nothing to sneeze at. However, if this is what practitioners take
its main contribution to be, they should let us know in so many words.
[3]
The paper (How can a Bayesian approach
inform neuroscience) is behind a paywall. Sorry. If you are affiliated with
a university you should be able to get to it pretty easily as it is a Wiley
publication.
This comment has been removed by the author.
ReplyDeleteOne huge attraction for the "methodological Bayes" stance is that it provides a principled way of studying how certain assumptions interact with certain inputs. As such, Bayesian modeling is an obvious tool to address questions of what can and cannot, in principle, be acquired from data, by a learner that embodies certain well-specified inductive biases. I don't know whether this is anything like double entry bookkeeping, but it certainly isn't something to sneeze at.
DeleteI'm happy to agree that there is not as much conceptually clarity as one would hope for. But that's hardly a feature that distinguishes Bayesian computational modelling from theoretical syntax.
[corrected typo]
I think I agreed in note 1. But if this is the appeal we should be told and it's interesting to see there is another point of view. Last, if this is the right conception we should dump all the talk of Marr. Agreed?
ReplyDeleteI think the methodological Bayes stance fits very nicely into Marr's levels. A Bayesian model specifies how a learner analyzes structures into parts. For example, the likelihood function for a dependency tree may compute the probability of the tree in terms of the probabilities of individual arcs between parts of speech, or it may compute the probability of the tree in terms of larger subtrees, or arcs between words and parts of speech, and so on. This is the computational level analysis, and says something like: if this is what our reusable pieces look like (e.g. tree substitution grammar elementary trees), and this is how much we prefer each piece before seeing data (we prefer smaller elementary trees), and this is our dataset that gives us the values of some of those variables (e.g. all the words and sentence boundaries), then here is the probability, for each location, that each piece was used. There is a diverse range of algorithms for actually computing these probabilities, or even just using them (in practice, as Ben keeps reminding me, researchers usually just want to find the most likely composition of pieces, and don't care what the actual probabilities are).
DeleteGiven a model specification, there are many different algorithms for performing inference in the model specification. Some of those algorithms make “soft” choices, combining partial guesses about unobserved structure. Other algorithms will make hard choices, comitting totally to a structure, but make these hard choices in proportion to the posterior probability over the long run of making choices. The choice of algorithm is the algorithmic level analysis. When OJB say that “it is a central characteristic of fully Bayesian models that they represent the full state space,” they must mean that the abstract (computational-level) model represents the full state space, not that the algorithm for performing inference in the model represents the full state space. The whole point of sampling and variational algorithms is that the full state space is too big to represent. Indeed, non-parametric models have an infinite state space that cannot be fully represented in finite memory.