Tuesday, November 19, 2013

Bayesian claims?

Last week I had two Bayes moments: the first was a vigorous discussion of the Jones and Love paper with computational colleagues with a CS mind set and the second was a paean delivered by Stan Dehaene in his truly excellent Baggett Lectures this year (here). These conversations were quite different in interesting ways. 

My computational colleagues, if I understood them correctly (and this is a very open question) saw Bayes as providing a general formal framework useful to the type of problems cognitive neuroscientists encounter. Importantly, on this conception, Bayes is not an empirical hypothesis about how minds/brains compute but an empirically neutral notation whose principle virtue is that allows you to state matters more precisely than you can in simple English. How so? Well minds/brains are complicated and having a formal technology that can track this complexity in some kind of normal form (as Bayes does) is useful. On this view, Bayes has all the virtues (and charms?) of double entry bookkeeping.[1] On this view, every problem is amenable to a Bayesian analysis of some kind so there is no way for a Bayes approach as such to be wrong or inapposite, though particular proposals can be better or worse. In short, Bayes so considered is more akin to C++ (an empirically neutral programming language) than to Newton’s mechanics (a description of real world forces).

There is a contrasting view of Bayes that is prevalent in some parts of the cog-neuro world. Here Bayes is taken to have empirical content. In contrast to the first view, its methods can be inappropriate for a given problem and a Bayes approach as such can even be wrong if it can be shown that its design requirements are not met in a given problem domain.  On this view, Bayes is understood to be a description of cog-neuro mechanisms and describing a system as Bayesian is to attribute to that system certain distinctive properties.

These two conceptions cannot be more different. And I suspect that the move between these two conceptions has muddied the conceptual landscape.  Empirically minded cognitive neuroscientists are attracted to the second conception for it commits empirical hostages and makes serious claims (I review some of these below). CS types seem attracted to the first conception precisely because it provides a general notation for dealing with any computational problem and it does so by being general and thus bereft of any interesting empirical content.[2] Whichever perspective you adopt, it’s worth keeping them separate.

I just read a very good exposition of the second conception of Bayes-as-mechanism by O’Reilly, Jbabdi and Behrens (OJB) (here).[3] OJB is at pains (i) to show what the basic characteristics of a Bayes system are, (ii) to illustrate successful cases where the properties so identified gain explanatory purchase within cog-neuro and (iii) to show where the identified properties are not useful in explaining what is going on. Step (iii) illustrates that OJB takes Bayes to making a serious empirical claim.  Here are some details, though I recommend that you read the paper in its entirety as it offers a very good exposition of why neuroscientists have become interested in Bayes, and it’s not just because of (or even mainly due to) its computational explicitness.

OJB identifies the following characteristics of a Bayesian system (BS):

1.     BSs represent quantities in terms of probability density functions (PDFs). These represent an observer’s uncertainty about a quantity (1169).
2.     BSs integrate information using precision weighting (1170). This means that the information is combined sensitive to their relative reliability (as measured probabilistically): “It is a core feature of Bayesian systems that when sources of information are combined, they are weighted by their relative reliability (1171).”
3.     BSs integrate new information and prior information according to their relative precisions. This is analogous to what occurs when several sources of new information are combined (e.g. visual and haptic info). As OJB puts it: “the combined estimate is partway between the current information and the prior, with the exact position depending on the relative precision of the current estimate and the prior (1171).”
4.     BSs represent the values of all parameters in the model jointly, viz. “It is a central characteristic of fully Bayesian models that they represent the full state space (i.e. the full joint probability distribution across all parameters) (1171).”

In sum: a BS represents info as PDFs, does precision weighted integration of old and new info, and fully represents and updates all info in the state space. If this is what a BS is, then there are several obvious ways that a given proposal can be non-BS. A useful feature of OJB is that it contrasts each of these definitional features with a non-BS alternative. Here are some ways that a proposed system can be non-BS.

·      It represents quantities as exact (contra 1). For example, in the MSTG paper here, learners represented their word knowledge with no apparent measure of uncertainty in the estimate.
·      It combines info without sensitivity to its reliability (contra 2). Thus, e.g. in the MSTG paper information is not combined probabilistically by considering the weighting of the prior and input. Rather contrary info leads one to drop the old info completely and arbitrarily choose a new candidate.
·      It uses a truncated parameter spaces or does not compute values for all alternatives in the space. Again, in MSTG paper the relevant word-meaning alternatives are not updated at all as only one candidate at a time is attended to.

The empirical Bayes question then is pretty straightforward conceptually: to what degree do various systems act in a BS manner and when a system deviates along one of the three dimensions of interest, how serious a deviation is it? Once again, OJB offer useful illustrations. For example, OJB notes that multi-sensory integration looks very BSish. It represents incoming info PDFly and does precision weighted integration of the various inputs. Well almost. It seems that some modalities might be weighted more than “is optimal” (1172). However, by and large, the BS model the central features of how this works. Thus, in this case, the BS model is a reasonable description of the relevant mechanism.

There are other cases where the BS idealization is far less successful. For example, it is well known that “adding parameters to a model (more dimensions to the model) increases the size of the state space, and the computing power required to represent and update it, exponentially (1171).” Apparently problems arise even when there are only “a handful of dimensions of state spaces” (1175). Therefore, in many cases, it seems that behavior is better described by “semi-Bayesian models,” (viz. with truncated state spaces) or “non-Bayesian models” (viz. in which some parameters are updated and some ignored) (1175).  Or models in which “variance-blind heuristics” substitute for precision weighted integration or “rather than optimizing learning by integrating information over several trials, participants seem to use only one previous exemplar of each category to determine its ‘mean’ (1175).”

OJB describe various other scenarios of interest all with the same aim: to show how to take Bayes seriously as a substantive piece of cog-neuro science. It is precisely because not everything is Bayesian that arguing that a mechanism is Bayes that one might get some explanatory insight from the classification. OJB take Bayes to be a useful description for some class of mechanisms, ones with the three basic characteristics noted above: PDF representations, precision weighted integration and fully jointly specified state space.

OJB points out one further positive feature of taking Bayes in this way: you can start looking for neural mechanisms that execute these functions, e.g. how neurons or populations of neurons might allow for PDF like representations and their integrations. In other words, a mechanistic interpretation of Bayes leads to an understandable research program, one with real potential empirical reach.

Let me end here. I understand the OMB version of Bayes. I am not sure how much it describes linguistic phenomena, but that is an empirical question, and not one that I will be well placed to adjudicate. However, it is understandable. What is less understandable are versions that do not treat Bayes as a hypothesis about mental/neural mechanisms. If Bayes is not this, then why should we care? Indeed, what possible reason can there be in not taking Bayes in this mechanistic way?  Is it the fear that it might be wrong so construed?  Is being wrong so bad?  Isn’t it the aim of cog-neuro to develop and examine theories that could be wrong? So, my question to non-mechanistic Bayesians: what’s the value added?

[1] This sounds more dismissive than it should perhaps. The invention of double entry bookkeeping was a real big deal and if Bayes serves a similar function, then it is nothing to sneeze at. However, if this is what practitioners take its main contribution to be, they should let us know in so many words.
[2] I believe that one can see an interesting version of these contrasting views in the give and take between Alex C and John Pate in the comments section here.
[3] The paper (How can a Bayesian approach inform neuroscience) is behind a paywall. Sorry. If you are affiliated with a university you should be able to get to it pretty easily as it is a Wiley publication. 


  1. This comment has been removed by the author.

    1. One huge attraction for the "methodological Bayes" stance is that it provides a principled way of studying how certain assumptions interact with certain inputs. As such, Bayesian modeling is an obvious tool to address questions of what can and cannot, in principle, be acquired from data, by a learner that embodies certain well-specified inductive biases. I don't know whether this is anything like double entry bookkeeping, but it certainly isn't something to sneeze at.
      I'm happy to agree that there is not as much conceptually clarity as one would hope for. But that's hardly a feature that distinguishes Bayesian computational modelling from theoretical syntax.
      [corrected typo]

  2. I think I agreed in note 1. But if this is the appeal we should be told and it's interesting to see there is another point of view. Last, if this is the right conception we should dump all the talk of Marr. Agreed?

    1. I think the methodological Bayes stance fits very nicely into Marr's levels. A Bayesian model specifies how a learner analyzes structures into parts. For example, the likelihood function for a dependency tree may compute the probability of the tree in terms of the probabilities of individual arcs between parts of speech, or it may compute the probability of the tree in terms of larger subtrees, or arcs between words and parts of speech, and so on. This is the computational level analysis, and says something like: if this is what our reusable pieces look like (e.g. tree substitution grammar elementary trees), and this is how much we prefer each piece before seeing data (we prefer smaller elementary trees), and this is our dataset that gives us the values of some of those variables (e.g. all the words and sentence boundaries), then here is the probability, for each location, that each piece was used. There is a diverse range of algorithms for actually computing these probabilities, or even just using them (in practice, as Ben keeps reminding me, researchers usually just want to find the most likely composition of pieces, and don't care what the actual probabilities are).

      Given a model specification, there are many different algorithms for performing inference in the model specification. Some of those algorithms make “soft” choices, combining partial guesses about unobserved structure. Other algorithms will make hard choices, comitting totally to a structure, but make these hard choices in proportion to the posterior probability over the long run of making choices. The choice of algorithm is the algorithmic level analysis. When OJB say that “it is a central characteristic of fully Bayesian models that they represent the full state space,” they must mean that the abstract (computational-level) model represents the full state space, not that the algorithm for performing inference in the model represents the full state space. The whole point of sampling and variational algorithms is that the full state space is too big to represent. Indeed, non-parametric models have an infinite state space that cannot be fully represented in finite memory.