Tuesday, November 12, 2013

On playing nicely together

UMD has a cognitive science lecture series with colloquia delivered every other Thursday afternoon to a pretty diverse audience of philosophers, linguists, psychologists, and the occasional computer scientist, neuroscientist and mathematician.  The papers are presented by leading lights (i.e. those with reputations). The last dignitary to speak to us was Fei Xu from Berekely psych and her talk was on how a rational constructivism, a new take on old debates, will allow us to get beyond the Empiricism/Rationalism (E/R) dualism of yore. Here’s the abstract:

The study of cognitive development has often been framed in terms of the nativist/empiricist debate.  Here I present a new approach to cognitive development – rational constructivism. I will argue that learners take into account both prior knowledge and biases (learned or unlearned) as well as statistical information in the input; prior knowledge and statistical information are combined in a rational manner (as is often captured in Bayesian models of cognition).  Furthermore, there may be a set of domain-general learning mechanisms that give rise to domain-specific knowledge.  I will present evidence supporting the idea that early learning is rational, statistical, and inferential, and infants and young children are rational, constructivist learners.

I had to leave about five minutes before the end of the presentation and missed the question period, however, the talk did get me thinking about the issues Xu mentioned in her abstract. I cannot say that her remarks found fertile ground in my imagination, but they did kick start a train of thought about whether the debate between Es and Rs is worth resolving (or maybe we should instead preserve and sharpen the points of disagreement) and what it would mean to resolve it. Here’s what I’ve been thinking.

When I started in this business in the 1970s, the E/R debate was billed as the “innateness controversy.”  The idea was that Rs believe that there are innate mental structures whereas Es don’t. Of course, this is silly (as was quickly observed and acknowledged), for completely unstructured minds cannot do anything, let alone think. Or, more correctly, both Es and Rs recognize that minds generalize, the problem being to specify the nature of these generalizations and the mechanisms that guide it (as mush doesn’t do much generalizing). If so, the E/R question is not whether minds have biologically provided structure that supports generalization but the nature of this structure.

This question, in turn, resolves itself into two further questions: (i) the dimensions along which mental generalizations run (i.e. the primitive features minds come with) and (ii) the procedures that specify how inputs interact with these features to fix our actual concepts. Taking these two questions as basic allows us to recast the E/R debate along the following two dimensions: (a) how “specific” are the innately specified features and (b) how “rational” are the acquisition procedures. Let me address these two issues in turn.

Given that features are needed, what constitutes an admissible one?  Es have been more than happy to acknowledge sensory/perceptual features (spf), but have often been reluctant to admit much else. In particular, Es have been averse to cognitive modularity as this would invite admitting domain specific innate features of cognitive computation. Spfs are fine. Maybe domain general features are ok (e.g. “edge,” “animate,” etc.). But domain specific features like “island” or  “binder” are not.

A necessary concomitant of restricting the mental feature inventory in this Eish way is a method for building complexes of features from the given simple givens (i.e. a combinatorics). Es generally recognize that mental contents go beyond (or appear to go beyond) the descriptive resources of the primitives.  The answer: what’s not primitive is a construct from these primitives. Thus, in this sense, constructivism (in some form) is a necessary part of any Eish account of mind.

A second part of any Eish theory is an account of how the innately given features interact with sensory input to give rise to a cognitive output. The standard assumption has been that this method of combination is rational.  ‘Rational’ here means that input is evaluated in a roughly “scientific” manner, albeit unconsciously, viz. (i) the data is carefully sifted, organized and counted and (ii) all cognitive alternatives (more or less) are evaluated with respect to how well they fit with this regimented data. The best alternative (the one that best fits the data) wins. In other words, on this conception, acquisition/development is an inductive procedure not unlike what we find more overtly in scientific practice[1] (albeit tacit) and it rests on an inductive logic with the principles of probability and statistics forming the backbone of the procedure.[2]

Many currently fashionable Bayesian models (BM) embody this E vision. See, for example, the Xu abstract above. BMs assume a given hypothesis space possibly articulated, i.e. seeded with givenprior knowledge and biases,”[3] plus Bayes Rule and maybe “a set of domain-general learning mechanisms.” These combine “statistical information” culled from the environmental input together with the “prior knowledge and biases” in a “rational manner” (viz. Bayesian manner) to derive “domain specific knowledge” from “domain-general” learning mechanisms.”

There are several ways to argue against this E conception and Rs have deployed them all. 

First, one can challenge the assumption that features be restricted to spfs or the domain general ones. Linguists of the generative stripe should all be familiar with this kind of argument. Virtually all generative theories tell us that linguistic competence (i.e. knowledge of language) is replete with very domain specific notions (e.g. specified subject, tensed clause, c-commanding antecedent (aka binder), island, PRO, etc.) without which a speaker’s attested linguistic capacity cannot be adequately described.

Second, one might argue against the idea that there is a lot of feature combinatorics going on. Recall, that if one restricts oneself to a small set of domain general features then one will need to analyze concepts that (apparently) fall outside this domain as combinations of these given features. Jerry Fodor’s argument for the innateness of virtually all of our lexical concepts is an instance of this kind of argument. It has two parts. First, it observes that if concept acquisition is statistical (see below) then the hypothesis space must have some version of the acquired concept as a value in that space (i.e. it relies on the following truism: if acquiring concept C involves statistically tracking Cish patterns then the tracker (i.e. a mind) must come pre-specified with Cish possibilities). Second, it argues that there is no way of defining (most of) our lexical concepts from a small set of lexical primitives. Thus, the Cish possibilities must be coded in the hypothesis space as Cish as such and not congeries of non-Cish primitives in combination. Put the two assumptions together and one gets that ‘carburetor’ must be an innate concept (specified as such in the hypothesis space).[4] This form of argument can be deployed anywhere and where it succeeds it serves to challenge the important Eish conception of a relatively small/sparse sensory/perceptual and/or domain general set of primitive features underlying our cognitive capacities.

Third one can challenge the idea that acquisition/development is “rational.”[5] [6] This is perhaps the most ambitious anti E argument for it is an assumption long shared by Rs as well.[7] One can see the roots of this kind of criticism in the distinction between triggering stimuli and formative stimuli. Rs have long argued that the relation between environmental input and cognitive output is less a matter of induction than a form of triggering (more akin to transduction than induction). Understanding ‘trigger’ as ‘hair trigger’ provides a way of challenging the idea that acquisition is a matter of induction at all. A paradigm example of triggering is one trial learning (OTL).[8]

To the degree that OTL exists, it argues against the idea that acquisition is “rational” in any reasonable sense. Minds do not carefully organize and smooth the incoming data and minds do not incrementally evaluate all possible hypotheses against the data so organized in a deliberate manner. If OTL is the predominant way that basic cognitive competence arises, it would argue for re-conceptualizing acquisition as “growth” rather than “learning,” as Chomsky has often suggested. Growth is no less responsive to environmental inputs than learning is, but the responsiveness is not “rational” just brute causal. It is an empirical question, albeit a very subtle one, whether acquisition/development is best modeled as a rational or a brute causal process. IMO, we (indeed I!) have been to quick to assume that induction (aka: learning) is really the only game in town.[9]

Let me end: there is a tendency in academic/intellectual life to split the difference between opposing views and to find compromise positions where we conclude that both sides were right to some degree. You can see this sentiment at work in the Xu abstract above. In interest of honesty, I confess to having been seduced by this kind of sort of gentle compromise myself (though many of you might find this hard to believe). This live and let live policy enhances collegiality and makes everyone feel that their work is valued and hence, valuable (in virtue of being at least somewhat right). I think that this is a mistake. This attitude serves to blur valuable conceptual distinctions, one’s that have far reaching intellectual implications.  Rather than bleaching them of difference, we should enhance the opposing E/R conceptions precisely so that we can better use them to investigate mental phenomena. Though, there is nothing wrong with being wrong (and work that is deeply wrong can be very valuable), there is a lot wrong with being namby-pamby. The E/R opposition presents two very different conceptions of how minds/brains work. Maybe the right story will involve taking a little from column E and a little from column R. But right now, I think that enhancing the E/R distinctions and investigating the pure cases is far more productive. At the very least it serves to flush out empirically substantive assumptions that are presupposed rather than asserted and defended. So, from now on, no more Mr. Nice Guy! [10]

[1] Actually, whether we find this in scientific practice is a topic of pretty extensive debate.
[2] There is a decision procedure required as well (e.g. maximize expected utility), but I leave this aside.
[3] Let me repeat something that I have reiterated before: there is nothing in Bayes per se that eschews pretty abstract and domain specific information in the hypothesis space, though as a matter of fact such has been avoided (in this respect, it’s a repeat of the old connectionism discussions). One can be a R-Bayesian, though this does not seem to be a favored combo. Such a creature would have Rish features in the hypothesis space and/or domain specific weightings of features. So as far as features go, Bayesians need not be Es. However, there may still be an R objection to framing the questions Bayes-wise, as I discuss below.
[4] I discuss Jerry’s argument more fully here and here. Note that it is important to remember that this is an argument about the innately given hypothesis space, not about belief fixation. Indeed, Jerry’s original argument assumed that belief fixation was inductive. So, the concept CARBURETOR may be innate but fixing the lexical tag ‘carburetor’ onto the concept CARBURETOR was taken to be inductive.
[5] Fodor’s original argument did not do this. However, he did consider this in his later work, especially in LOT2.
[6] I am restricting discussion here to acquisition/development. None of what I say below need extend to how acquired knowledge is deployed on line in real time, in, e.g. parsing. Thus, for example, it is possible that humans are Bayesian parsers without their being Bayesian learners. Of course, I am not saying that they are, just that this is a possible position.
[7] E.g. the old concept of children as “little linguists” falls into this fold.
[8] See here for some discussion and further links.
[9] There are other ways of arguing against the rationality assumption. Thus, one can argue that full rationality is impossible to achieve as it is computationally unattainable (viz. the relevant computation is intractable). This is a standard critique of Bayesian models, for example. The fall back position is some version of “bounded” rationality.  Of course, the tighter the bounds the less “rational” the process.  Indeed, in discussions of bounded rationality all the interesting action comes in specifying the bounds for if these are very narrow, the explanatory load shifts from the rational procedures to the non-rational bounds. Economists are currently fighting this out under the rubric of “rational expectations.” One can try to render the irrational rational by going Darwinian; the bounds themselves “make sense” in a larger optimizing context. Here too brute causal forces (e.g. Evo-Devo, natural physical costraints) can be opposed to maximizing selection procedures. Suffice it to say, the E/R debate runs deep and wide. Good! That’s what makes it interesting.
[10] For the record, lots of the thoughts outlined above have been prompted by discussions with Paul Pietroski. I am not sure if he wishes to be associated with what I have written here. I suspect that my discussion may seem to him too compromising and mild.


  1. "To the degree that OTL exists, it argues against the idea that acquisition is “rational” in any reasonable sense."

    I guess the classic example in language acquisition of OTL is fast mapping -- and that has Bayesian explanations like, say, Fei Xu's own work, that are both "rational" and "probabilistic". I am having trouble seeing how word learning could be other than rational.

  2. "Here [in evolution] too brute causal forces (e.g. Evo-Devo, natural physical costraints) can be opposed to maximizing selection procedures."

    You'll need to clarify (especially for this evolutionary naif) what the role of evo devo is in this context. But "natural physical constraints": isn't the whole point that selective forces do and indeed must select organisms with traits that are "better" in some sense, _up to_ physical limitations? You might have to forgive some naif's coarseness there but I couldn't be too too far off, mm? And isn't the point of the rational cognition program to explain cognitive processes as "as close to optimal as we can get under the circumstances"? Either on an organism level or an evolutionary level, different cases may be different (bug me for examples). In other words the "brute causal forces" you refer to must warrant some explanation for being one way and not another. It seems to me the only useful dichotomy is the phenotype-genotype level dichotomy, in one way or another things must be the way they are for a reason. Unless you reject this, I claim that you are merely elaborating the details of the rational cognition program.

  3. Ewan and Alex rightly press me on what I mean by rational cognition. Here's a shot: I take the program to be part of the continuation of a long tradition that sees induction as the key to understanding cognition. Hume is the poster pinup here. The aim of theories of induction was to model how scientific deliberation worked; how data collection and evaluation leads to truth. the idea was (and is) that how data is sampled, how it is organized and how it is used to evaluate alternatives plays a major part in explaining why some investigations lead to truth and some don't. In this context, for example, induction is contrasted with abduction and learning is contrasted with growth and gradual learning with one trial learning. Abduction, one trial learning, growth are NOT rational processes in the normal sense of the term. So, one way of taking my point is that I want to know how the Bayesian Rational Cognition program fits with these ideas. One answer is that it is at right angles to it. Another is that it abstracts away from it. Another is that it embodies it. I have been assuming that it sees itself as part of this tradition (hearing Xu e.g. start with a contrast of the rationalist and Empiricist traditions and claiming that we can eat our two cakes etc suggested that part of the sell of this approach is the traditional one). However, I could be wrong. Others treat this "approach" as effectively a suggestion for a normal form notation without making ANY empirical claims. Is this indeed the whole point? Let's talk Bayes because it's neutral wrt any questions we will be interested in? Or is it that Bayes has content and it is the right tool for the job of explaining development/acquisition. In which case I want to know what it is about Bayes that makes it such. What does IT bring to the table empirically? So when I take it that induction is not the right way of thinking about a problem, I mean it as seen against the great tradition. It is not merely a technical question.

    Let me make this point another way: is 'rational' in 'rational cognition' like 'significant' in 'statistically significant'? As we all know statistically significant results can be trivial.

    1. You are trying to squish together too many different distinctions that are mostly independent.
      I understand (mostly) the brute causal/rational distinction.
      But that seems completely different from the gradual/OTL distinction.
      Growth for example is normally slow -- so if the analogy for language acquisition is something like pubescence or the growth of a kidney that is a gradual process.

    2. Yes, they are different dimensions along which E/R approaches contrast. The main interest in OTL for Rs is that it suggests that classical "learning" is not the real issue: one trial learning suggests a great deal of mental baggage. And yes this is different from the first distinction, though not less interesting.

  4. I think the relevance of Bayesian methods comes down to whether you think there is uncertainty in the process of language acquisition. Advocates of "deterministic" triggering-based algorithms, which never make a learning mistake (and so are really "inerrant" rather than simply "deterministic") are really proposing that there is no uncertainty in the acquisition process: if we know ahead of time certain relationships between words and categories, the grammar can be straightforwardly decoded from the input.

    However, if we accept that there is uncertainty involved in some aspect of language acquisition, then probabilistic models are useful, because they measure uncertainty about the values of hypothesized variables (and Bayesian models are just probabilistic models that represent uncertainty about the model parameters). For example, if we think that children use an inerrant triggering algorithm for learning syntax once they are confident in the part of speech for each word, but that there is uncertainty about the mapping between parts of speech and words, we could use a Bayesian model to measure when there is good evidence for particular sequences of parts of speech.

    1. I like this way of putting matters. Would you agree that the issue is just how inerrant it is? In other words, I take OTL to be a limit case that IF correct suggests that indicative methods are really not where the action is. However, the way I would like to think about it is that the narrower the range of alternatives the less work there is for inductive methods to do.

    2. @ John: You say: "if we think that children use an inerrant triggering algorithm for learning syntax once they are confident in the part of speech for each word," This sounds fascinating, Do we have any idea how such a mechanism might have evolved?

    3. @Norbert: I'm deliberately avoiding making proposals about the actual learning procedure children use. If they are faced with uncertainty, then probabilistic models can measure what counts as good evidence when, under explicit assumptions about what linguistic structures look like. In other words, probabilistic models can tell us about the shape of the data under different assumptions about linguistic structures, without committing to any one inference procedure on the part of the child. Indeed, the field of machine learning has shown that algorithms with very different behavioral properties are capable of approximating inference in the same underlying model.

      If OTL occurs in the face of uncertainty, then probabilistic models can allow us to measure the utility of different cues the learning strategy, whatever it is, might be attending to. Those same probabilistic methods may also suggest inference procedures that are capable of sudden changes of state, but the effectiveness of a probabilistic model for measuring uncertainty is different from its effectiveness for replicating the sequence of steps a child follows. Personally, I expect that language acquisition is full of uncertainty, and so the strategies children follow should have probabilistic justifications, but they don't necessarily need to be the kind of clean and natural relaxations of optimal procedures that come up in machine learning.

      @Christina Behme: I don't know how such a mechanism would have evolved. I had in mind this paper by Sakas and Fodor:

      Maybe I should mention that I'm not advocating for this proposal myself. I merely brought it up to illustrate how probabilistic models are useful for measuring uncertainty in hypothesized variables, regardless of the broader theoretical framework, since measuring uncertainty is just what probabilistic models do.

    4. Let's consider a concrete example. Trigger learning algorithms proceed by maintaining a single hypothesis about the grammar, keeping that hypothesis if it is capable of parsing observed sequences, and moving to a different hypothesis if the grammar cannot handle some observed sequence. This algorithm in broad outline is essentially the “win-stay/lose-shift” algorithm (WSLS). WSLS approximates Bayesian inference when the likelihood function is deterministic, and can be relaxed to non-deterministic likelihood functions (e.g. So the sudden changes that occur with trigger-learning algorithms are possible with algorithms that perform inference in a Bayesian model.

      One major difference between trigger learning algorithms and WSLS, or other particle filter-based algorithms, is that trigger learning algorithms typically assume that a Parameter cannot be changed once it has been set. Technically, this means that the markov chains of such trigger learning algorithms are not ergodic, and they can get “stuck” in bad hypotheses. This is why the “subset” problem arises for trigger learning algorithms but not for most statistical learners: a hypothesis that generates a language that is too large will pay a statistical price. The Sakas and Fodor (2012) paper I mentioned in my previous comment addresses the non-ergodicity by arguing that there is never uncertainty about the goodness of some transitions in the markov chain. If children in fact follow a strategy like the one they outline, then the procedural similarities between trigger learning algorithms and particle filters are just superficial coincidences: the success of the child is not due to proper handling of uncertainty because there was never any uncertainty.

    5. Hi John,

      Take an even simpler learner:
      Say you have a general Bayesian classifier, two classes A and B,
      and maybe in this particular case the supports of the distributions p(x|A) and p(x|B) are disjoint -- i.e. always one of the two is zero. Then when you see one point with say p(x|B) = 0, then you know that p(A|x) = 1.
      I.e. this is OTL.
      So I would say this is a rational Bayesian learner which does OTL. Are you arguing that this is not Bayesian because there is no uncertainty?

      What would Norbert say for this example?

    6. No, I'm happy to say that that learner is Bayesian, but Bayesian learning in this situation is probably overkill. You could explain the same pattern by appealing only to set membership, and Bayesian models also appeal to sets. The extra moving parts of Bayesian models are justified when you want to address uncertaiinty.

    7. In some cases there is certainly is a certain degree of uncertainty of that we can be .. quite sure (as Blackadder puts it), so we need a mechanism to deal with that-- say some general Bayesian reasoner. The cases of certainty and OTL are just special cases where the posterior distribution is 0 or 1, and so we don't need a separate mechanism, as the Bayesian reasoner will work just fine (overkill as you put it); so the more parsimonious explanation is that certainty is just a special case of uncertainty.

    8. Yes, it is possible that children have a general purpose Bayesian sampler that they can plug in to any inference problem, and that it's just easier to use the general purpose Bayesian sampler even for deterministic problems than to devise a special-purpose theorem prover. However, biology is messy, and I'm not sure we should assume that the brain follows software engineering best practices. My point was just that, regardless of how the inference toolkit is laid out, if it works by handling uncertainty, Bayesian methods are useful for measuring uncertainty under different ways of analyzing linguistic structures into parts.

    9. @ John: You say:

      "However, biology is messy, and I'm not sure we should assume that the brain follows software engineering best practices."

      True as this may be, I do not see how it addresses Alex' point that having 1 mechanism is simpler than having 2. Besides being 'messy', brains also have to 'economize' [they consume the most 'fuel' of all biological organs as it is] - so why maintain one [metabolically costly] mechanism that rarely gets used [even if such should have evolved - I have no clue how this could have happened in the first place but that's another issue] when one can do almost as well with only the other?

      To repeat a point I made earlier on this blog: unless you actually do biology [e.g. brain-research] and have some "hard" evidence for specialized mechanisms, it seems idle to speculate. If people like Alex can show that a Bayesian sample CAN do the job there seems no a priori reason to rule out that brains can also make due with one mechanism for many cognitive tasks. Whether they actually do is an entirely different question and i have not given up hope entirely that one day Norbert will convince one of the biologists of the biolinguistic enterprise to explain to us how UG is implemented in the brain...

    10. " If people like Alex can show that a Bayesian sampler CAN do the job there seems no a priori reason to rule out that brains can also make due with one mechanism for many cognitive tasks."

      No pun intended, but that depends on your prior.

    11. I am mostly interested in the question of whether there are any arguments that the existence of OTL implies that empiricist or rational or probabilistic learning is wrong. But it seems that there aren't.

    12. @ Alex Clark: If the suggestion is that "certainty and OTL are just special cases where the posterior distribution is 0 or 1", then effectively one is saying the priors need to be 1/0, right? If so, aren't you committing to some non-empiricist knowledge (priors = 1/0). Note, the OTL data suggests that kids are also able to revise their hypotheses that they learn through OTL; therefore, modelling OTL through 1/0 priors wouldn't allow for this. In fact, there would be no "learning" left.

    13. No, I was thinking of two hypotheses with a flat prior -- 0.5 each -- the likelihood functions ( prob of data given hypothesis) have disjoint support.

    14. @Behjamin: Thank you so much for this ingenious pun. Since Alex solved the prior problem you raised prior to me having a chance to respond, maybe you could be so kind and educate us on biological implementation?

    15. I don't see how what Alex wrote has any bearing on whether or not there are good "a priori" reasons for or against preferring single domain-general mechanism explanations over alternatives.

      As for biological implementation, biological implementation of what? The intuiting faculty that allows us to stand in a knowledge-of-relation with abstract entities such as "languages"? The (one and only?) domain general learning mechanism that is the answer to all of our problems? Or of a language faculty of the kind you seem to have serious misgivings about?
      I'm happy to admit that I can't give nor neither know of any detailed account of how any of these are biologically implemented. But as far as I can see, we are all in the same boat here, "empiricists" or "rationalists". So what?

    16. May I ask how you KNOW that 'we're all in the same boat'? Are you intimately familiar with recent work in neurophysiology/developmental psychology? That you even have to ask 'biological implementation of what?' suggests otherwise - someone taking the 'bio' in biolinguistcs seriously would not ask such a question. He would talk about what he [or his colleagues] has [have] discovered [little as this might be] and compare it to what those working in other frameworks have discovered [little as that might be]. If, as you say, everyone would be equally ignorant, why would your ignorance be any better than the ignorance of a person you deride as 'empiricist'?

    17. I really fail to see where I was describing certain kinds of ignorance as better as others, deriding anyone as 'emipricist' or even claiming to KNOW that we're all in the same boat.

      I'd be delighted to be pointed to any recent work in neurophysiology that tackles the most fundamental implementation problem for any cognitive theory, i.e. how any kind of structured representations could be represented in the brain.

    18. This is out of my area of expertise but there is a lot of work on population coding of various types of representations, especially in the visual cortex (for obvious methodological reasons).
      I don't follow this work but put "population coding" into google scholar and
      you will get several hundred recent papers.
      (or there are some videos here

      It is an interesting question about whether one can join this sort of work up with the concerns that we have on this blog. I am unconvinced at the moment that it has much relevance as the gap seems too large; but I am entirely open-minded about this.

      Maybe someone that knows this literature better could point us to some more pertinent recent work that bears on this.

    19. Thanks for this. From a quick glance this reminded me of Paul Smolensky's work on coding tree-representations in artificial neural networks. I share what I take to be your scepticism, however, as to whether any of the current work really would allow us to cash in our concepts in (artificial) neuro-vocabulary, much less "biological" notions. In any case, I fail to see why that would be problematic, and how this ought to be different for (Chomskyan) Generative Linguistics than for Cognitive Science in general.

    20. Yes agreed. I think Christina's objection is to what she perceives a rhetorical overreach by biolinguists; and while that is a perfectly reasonable point, it's not really relevant to the current issue.

  5. There are two cases where people say we should move beyond the difference between X and Y. In one, X versus Y is a false dichotomy. In the other, the question of X versus Y is in simply ill-posed. This is one of the second sorts of cases. It is often useful to go to great lengths to try and precisify their intuitions and those of others, but, frankly, not when there are whole mountains of precisely theory that make genuinely meaningful distinctions about learning...