Thursday, April 23, 2015

Where the estimable Mark Johnson corrects Norbert (sort of)

For various reasons, Mark J could not post this as a comment on this post. As he knows much more than I do about these matters, I thought it a public service to lift these remarks from the comments section to make them more visible. I think that this is worth reading in conjunction with Charles' recent post (here). At any rate, this is interesting stuff and I don't disagree much (or have not found reasons to disagree much) with what Mark J says below. I will of, course,  allow myself some comments later. Thx Mark. 


This was originally written as a comment for the "Bayes and Gigerenzer" post, but a combination of a length restriction on comments and my university's not enabling blog posts from our accounts meant I had to email Norbert directly.

As Norbert has remarked, Bayesian approaches are often conflated with strong empiricist approaches, and I think this post does that too.  But even within a Bayesian approach, there are powerful reasons not to be a "tabula rasa" empiricist.  The "bias-variance dilemma" is a mathematical statement of something I've seen Norbert say in this blog: learning only works when the hypothesis space is constrained.  In mathematical terms, you can characterise a learning problem in terms of its bias -- the range of hypotheses being considered -- and the variance or uncertainty with which you can identify the correct hypothesis.  There's a mathematical theorem that says that in general as the bias goes down (i.e., the class of hypotheses increases) the variance increases.

Given this, I think a very reasonable approach is to formulate a model that includes as much relevant information from universal grammar as we can put into it, and perform inference that is as close to optimal as we can achieve from data that is as close as possible to what the child receives.  I think this ought to be every generative linguist's baseline model of acquisition!  Even with an incomplete model and incomplete data, we can obtain results of the kind "innate knowledge X plus data Y can yield knowledge Z".

But for some strange (I suspect largely historical) reason, this is not how Chomskyian linguists think of computational models of language acquisition.  Instead, they prefer ad hoc procedural models.  Everyone agrees there has to be some kind of algorithm which children use to learn language.  I know there are lots of pyschologists who are sure they have a good idea of the kinds of things kids can and can't do, but I suspect nobody really has the faintest idea of what algorithms are "cognitively plausible".  We have little idea of how neural circuitry computes, especially over the kinds of hierarchical representations we know are involved in language.  Algorithms which can be coded up as short computer programs (which is what most people have in mind when they say simple) might turn out to be neurally complex, while we know that the massive number of computational elements in the brain enable it to solve computationally very complex problems.  In vision -- a domain we can sort-of study because we can stick electrodes into animals' brains -- it turns out that the image processing algorithms implemented in brains are actually very sophisticated and close to Bayes-optimal, backed up with an incredible amount of processing power.  Why not start with the default assumption that the same is true of language?

It's true that in word segmentation, a simple ad hoc procedure -- a simple greedy learning algorithm that ignores all interword dependencies -- actually does a "reasonable job" of segmentation, and that improving either the algorithm's search procedure or making it track more complex dependencies actually decreases the overall word segmentation accuracy.  Here I think Ben Borschinger's comment has it right - sometimes a suboptimal algorithm can correct for the errors of a deficient model if the errors of each go in the opposite way.  We've known since at least Goldwater's work that an inaccurate "unigram" model that ignores inter-word interactions will prefer to find multi-word collocations and hence undersegment.  On the other hand, a naive greedy search procedure tends to over-segment, i.e., find word boundaries where there are none.  Because the unigram model under-segments, while a naive greedy algorithm over-segments, the combination actually does better than approaches where you just improve only the search procedure or only the model (by incorporating inter-word dependencies) since now you have "uncancelled errors".

Of course it's logically possible that children use ad hoc learning procedures that rely on this kind of "happy coincidence", but I think it's unlikely for several reasons.

First, these procedures are ad hoc -- there is no theory, no principled reason why they should work.  Their main claim to fame is that they are simple, but there are lots of other "simple" procedures that don't actually solve the problem at hand (here, learn words).  We know that they work (sort of) because we've tried them and checked that they do.  But a child has no way of knowing that this simple procedure works while this other one doesn't, so the procedure would need to be innately associated with the word learning task.  This raises Darwin's problem for the ad hoc algorithm (as well as other related problems: if the learning procedure is really innately specified, then we ought to see dissociation disorders in acquisition, where the child's knowledge of language is fine, but their word learning algorithm is damaged somehow).

Second, ad hoc procedures like these only partially solve the problem, and there's usually no clear way to extend them to solve the problem fully, so some other learning mechanism will be required anyway.  For example, the unigram+greedy approach can find around 3/4 of tokens and 1/2 of the types, and there's no obvious way to extend it so it learns all the tokens and all the types.  But children do eventually learn all the tokens and all the types, and we'll need another procedure for doing this.  Note that the Bayesian approach that relies on more complex models does have an account here, even though it currently involves "wishful thinking": as the models become more accurate by including more linguistic phenomena and the search procedures become more accurate, the word segmentation accuracy continues to improve.  We don't know how to build models that include even a fraction of the linguistic knowledge of a 3 year old, but the hope is that eventually these models would achieve perfect word segmentation, and indeed, be capable of learning all of a language.  In other words, there isn't a plausible path by which the ad hoc approach would generalise to learning all of a language, while there is plausible path for the Bayesian approach that relies on more and more accurate linguistic models.

Finally -- and I find it strange to be saying this to a linguist who is otherwise providing very cogent arguments for linguistic structure -- there really are linguistic structures and linguistic dependencies, and it seems weird to assume that children use a learning procedure that just plain ignores them.  Maybe there is a stage where children think language consists of isolated words (this is basically what a unigram model assumes), and the child only hypothesises larger linguistic structures after some "maturation" period.  But our work shows that you don't need to assume this; instead, a single model that does incorporate these dependencies combined with a more effective search procedure actually learns words from scratch more accurately than the ad hoc procedures.

Norbert sometimes seems very sure he knows what aspects of language have to be innate.  I'm much less sure myself of just what has to be innate and what can be learned, but I suspect a lot has to be innate (I think modern linguistic theory is as good a bet as any).  I think an exciting thing about Bayesian models is that they give us a tool for investigating the relationship between innate knowledge and learnability.  For example, if we can show that a model with innate knowledge X+X' can learn Z from data Y, but a model with only innate knowledge X fails to learn Z, then probably innate knowledge X' plays a role in learning Z.  I said probably because someone could claim that the child's data isn't just Y but also includes Y' and from model X and data Y+Y' it's possible to infer Z.   Or someone might show that a completely different set of innate knowledge X'' suffices to learn Z from Y.  So of course a Bayesian approach won't definitely answer all the questions about language acquisition, but it should provide another set of useful constraints on the process.


  1. This comment in is into parts due to length. It's sort of fun commenting on someone else's post.

    Part 1:

    @ Mark, some comments:

    “…Bayesian approaches are often conflated with strong empiricist approaches, and I think this post does that too.”

    If so, this was entirely unintentional. All that I was interested in was finding some real live exemplars of “less is more” reasoning. From what I was able to tell, it has generally been taken for granted that the only issue wrt finding algorithms is dealing with resource issues. So there was a tacit assumption that Carnap’s Principle is a regulative ideal in that the more information you can take into account the better. And by “better” we mean does a better job of doing what needs doing, e.g. word learning, word segmentation etc. I took Gigerenzer to be questioning this. And this had nothing to do with Empricism (though you are quite right to think I see E’s baleful menace lurking everywhere). As evidence of my intent I note that I mentioned that these considerations applied to my favorite view of things linguistic: the ideal speaker-hearer model. At any rate, here I plead innocent.

    “Given this, I think a very reasonable approach is to formulate a model that includes as much relevant information from universal grammar as we can put into it, and perform inference that is as close to optimal as we can achieve from data that is as close as possible to what the child receives.”

    “Why not start with the default assumption that the same is true of language?”

    I could not agree more as a research strategy. However, it seems worth keeping in mind that the “optimal” strategy might not be the “rational” one if by this we mean adhering as closely as possible to Carnap’s Principle. I take this to be a consequence of Charles’s discussion where he provided some parameters for when ignoring information might be optimal even if not rational via CP.

  2. Part 2:

    I completely agree with your very reasonable reservations about the relations between algorithmic complexity and brain circuitry. But I am not sure about which Chomskyans you are thinking of when you think them opposed to rational analyses of acquisition. Like I’ve said before, this program looks very like the one outlined in Aspects with a few statistical bells and whistles. I know that Jeff Lidz is a big fan and I’ve discussed his work on FoL (here). The push back has not been ideological (e.g. Berwick, Yang) but empirical (this approach does better than the Bayes one in this context). What’s wrong with that? It is, I would think, an empirical issue in the end, right? Bayesians sometimes talk as if their game is the only one in town because it is rational. Well, maybe. But it’s nice to find cases where we can “test” the view and it is important to see that this argument is not conceptual but empirical. For this we need examples at work. Are these analyses right? I wouldn’t be in a position to know. Are they interesting? Well, yes, at least conceptually.

    “But our work shows that you don't need to assume this; instead, a single model that does incorporate these dependencies combined with a more effective search procedure actually learns words from scratch more accurately than the ad hoc procedures.”

    Again, my intent was not to endorse the models but to provide exemplars of where the common CP assumption might prove false and how to see this in a real world case. I find this idea interesting for my own parochial reasons. Your bet, which might be right, is that once the right nativist assumptions are plugged in that care about linguistic structure that something like Bayes will prove to be close to correct, as in vision. Maybe. Like I said, this would vindicate the Aspects view of things (and would not be something that I would gainsay). But it is worth noting alternatives that are not silly even if they might be wrong. Again, the only way to find out is by inspecting particular cases.

    “I think an exciting thing about Bayesian models is that they give us a tool for investigating the relationship between innate knowledge and learnability.”

    I agree and I’ve highlighted pieces that try to do this in FoL. So, let’s be clear: I have nothing against Bayes. In fact, I think that it’s worth investigating empirically. I do have a problem with the assumption that it must be right and with some of the overselling. But the basic idea seems fine. And with all such fine ideas it’s nice to know what would be considered a non-version of the idea.

  3. Thanks Norbert for posting my comments -- as an article nonetheless! I've been called many things, but this is the first time I've been called "estimable". (Earlier this week a colleague said I had "gravitas"; I'm not sure if she's referring to my age or my weight!).

    Anyway, at the risk of boring our readers by having too great an agreement, I do of course agree with you that a Bayesian approach is very close to the Aspects view. As I hope you also agree, Bayesian theory also suggests general principles for setting parameters in a Principles and Parameters approach, and a way around "no negative evidence" issues in parameter setting, so I think it is quite compatible with the general generative point of view. (I'm not sure it has much to say about Darwin's problem and Minimalism, but I'm open to suggestions!)

    In order to enliven this post let me end with a few deliberately tendentious things.

    First, I think some people -- often psychologists -- speak as if experimental results such as reaction times are the most important facts about language. But of course they aren't. The central fact about language acquisition is that children actually acquire languages, just as the central fact about language processing is that humans actually produce and comprehend sentences. A theory of acquisition or processing is seriously deficient if there's no plausible way of extending it to account for these central facts, even if it happens to agree exquisitely with certain experimental results. (Let me temper this by saying that it may be reasonable to focus on a deficient theory if one thinks it is the most promising path forward, but you shouldn't forget that the theory is nevertheless deficient!).

    I also think that a weakness of the Bayesian approach is that it hypothesises an "ideal learner", but doesn't explain how such an ideal learner might actually function. I think that's actually a reasonable approach, given that we know next to nothing about how complex representations such as trees are actually processed in the brain. It parallels the classical approach in generative grammar to language processing, which is factored into competence and performance. A Bayesian "ideal learner" is a kind of competence theory, and eventually I hope we'll be able to understand the "learning performance limitations" of real human learners. I'd like to see an approach to acquisition that parallels the classical approach to performance in generative grammar, in which performance constraints interact with but don't supplant the ideal competence theory.



    1. Well, if the Latin is any guide, it cannot be your age, unless she has some relativistic ideas about clocks and speed in mind. At any rate, is stick with 'estimable.'

      Ok, we seem to be in agreement. So, just to roil the waters a little, let me demur from one point. I agree that the amazing thing is that people acquire a language at all. And I am firmly on board with the skepticism regarding how brains do anything, including processing trees. That said, I am a little less skeptical than you are concerning the utility of pushing Bayes accounts to be more respectful of the little things we seem to know. Take Charles' recent post. It does more than say that Bayes doesn't work. It tries to understand why it doesn't and where it might. In other words, it takes the proposal seriously and explores its scope and also its limits. In other words, it treats it as an empirical thesis, which I assume we can both agree is for the good.

      Moreover, it doesn't supplant a competence approach, but explains how certain learning strategies that seem to be biologically grounded might work in this problem domain.

      Last point: if we are to wait until neuro types understand how brains really work we will be waiting a very long time. Indeed, the converse may be more fruitful: do what you can using current psycho techniques (of course in a responsible manner) and let the brain sciences catch up. This doesn't mean not abandoning the ideal learner, but it does mean that we can start to explore how various consequences of tweaking various assumptions.

    2. Seems like we are all in broad agreement and only disagree in details, so I thought I'd chime in a bit as well.

      It is not clear to me that UG folks only favor ad hoc learning models. Is triggering or constraint demotion ad hoc? Maybe, in the sense that they work in a very specific domain but no because they fall under the broad range of hypothesis testing and error driven learners well established in other domains. Same goes with the reinforcement learning models I and others have been using to set parameters and rank constraints: a learning mechanism throughout the animal world. Likewise, John and Lila’s experiments on word learning show very specific patterns that are incompatible with global models. Pursuit is a specific implementation, but we use it because it’s compatible with the behavioral results—and because it is a well studied subclass of reinforcement learning (Sutton and Barto 1998, section 2.9).

      Re indirect negative evidence. Yes, that’s a main appeal of the Bayesian approach but the question is does it work? In a forthcoming article for Language (, I looked at how children may learn "*asleep cat" is ungrammatical. Examining child directed English data, it seems that these a-adjectives are indistinguishable from typical adjectives on the basis of frequency and paraphrase (aka Uniqueness), two most prominent forms of indirect negative evidence that have been suggested in both categorical and statistical settings. This of course does not prelude other use of indirect negative evidence but there is work to do.

    3. From Mark Johnson via Norbert Mail:

      To my mind one of the most important properties of Bayesian approaches is what the statisticians call consistency. A consistent estimator is one that eventually converges on a hypothesis when given increasing amounts of data generated by that hypothesis (assuming the hypothesis generating the data is within the learner's hypothesis space). Consistency is an important property because it guarantees that a learner will eventually learn what it is supposed to. Inconsistent learning procedures are ad hoc: maybe they will work on certain data, but maybe not.

      Bayesian estimators aren't the only consistent estimators. In fact, the consistency of Bayesian estimators follows from the consistency of maximum likelihood estimators. And there are other kinds of consistent estimators, such as moment-matching procedures.

    4. @ Mark:

      "Inconsistent learning procedures are ad hoc: maybe they will work on certain data, but maybe not."

      I am not sure that this is yet a problem. All that means is that we need a good bead on the data if we are to discuss "ad hoc" "inconsistent" estimators, right? So, if we have a pretty good idea of what the PLD is, for example (by careful empirical investigation) then what's empirically wrong with an ad hoc estimator if it comports with the PLD? And in fact in many cases we do have a pretty good idea of what the PLD looks like (in Charles' and Lila's and John's cases a very good idea).

      Let me ask this another way. Though I am interested in the question of how one could in principle acquire something, I have no aversion to the question of how one actually does. If we can't fix the relevant data domain, then go with what we can do (estimators that must converge). But if we do know the lay of the data, why ignore it and reach for a more general solution that fails in other respects (e.g. does not acquire particularly well)? Perhaps I am missing something, but consistency here may be the hobgoblin of a certain kind of empirical pessimism, one that need not be entirely well founded in particular cases. But like I say, I may have missed the point.

    5. @Mark:

      To add to Norbert’s comment, the consistency is Bayesian calculations comes with added requirements of having all the relevant information, and effectively infinite time. But, this is never the case in reality. I don’t know how useful consistency is (this honestly is an open question). In fact, to me the Bayesian vs. frequentist vs…. debates also revolve around this question too.

      How useful is it to say we can learn things with if we had access to all the information and an inordinate amount of time? Surely, the more important question is given finite time (3-6 years in the case of language acquisition), and given sloppy, incomplete, and sometime “wrong” data, how effective is the learning procedure. To me, at least some ad hoc procedures start with this question.

    6. The amount of time enters into this both in terms of the amount of data you have and the amount of computation you have and I agree that both of these are really important; the latter in particular is neglected by Bayesians in general and is an important factor. One can say as Mark does, that we just want a competence model, which is fine, but just as there are reasons for rejecting grammars whose decision problem is intractable, there are also reasons for rejecting learning models where the problems are intractable. And Bayesian models are computationally hard to learn for the
      classes of grammars currently used -- in general.

      But apart from those caveats, let me play the high theory/Chomsky/Galileo trump card here. Of course we will learn things by considering an idealized, mathematically tractable variant of the problem, because that is just how it works on science. You study the simplified problem neglecting air resistance etc etc and then later try to account for the complexities of the true situation once you have an adequate analysis of the simple case.

      It baffles me to me making this point against you guys; I often make this point in defense of generative grammar when I am talking to NLP types. This is one of theoretical points on which I am on team Chomsky.

      (I see that Mark makes a different but related point below)

    7. I don't think we know that the LAD is consistent in this sense, but we do know that it has lots of fixed points (languages that are learned pretty well from the kinds of samples that their users produce in practice), and finds them very fast: it is usually possible for children to talk to their grandparents and vice versa, and creole formation doesn't seem to involve a stage where the language goes thru lots of changes in a chaotic manner like feedback system that can't stabilize.

      For the LAD to actually be consistent would presumably be the simplest explanation of this, although presumably not the only one.

  4. Another comment from Mark J via N(orbert)-mail:

    Of course all we need is a theory that accounts for the languages children actually learn from the kinds of data they actually receive. But so far nobody has anything close to this. There's at least a pathway along which Bayesian approaches can be extended to achieve this goal, although I expect there will be many challenges and surprises along the way. Of course it's interesting that Charles' model agrees with his experimental data, but I think it's important to remember that these experiments aren't the same as language acquisition, and I don't see a way of extending Charles' model to provide a general model of language acquisition. And I see explaining language acquisition, rather than any single psychological experiment, as our goal. (I don't want to put words in Charles' mouth, but I think he could plausibly argue that it's premature to worry about general theories of language acquisition; his experiments might be our equivalent of Galileo's inclined planes, and all will become clear when our Newton arrives).

    I think a version of Darwin's problem is lurking behind much of computational language acquisition and computational psycholinguistics. How does the child know which algorithm it should use to acquire all different kinds of knowledge it acquires? (This is still very challenging even if one claims that only the lexicon needs to be acquired since lexical entries can be quite abstract, e.g., empty functional categories that determine word order and extraction domains). What about the parsing algorithm and the production algorithm? It's possible that we have innate procedures for each of these (e.g., computer programs in the genome), but I think we should try to see if they might follow from more general principles. Approaches like the Bayesian one provide a partial answer here: if prior information and data are combined following certain principles, then the posterior is guaranteed to converge to the correct result. So if the child can find an algorithm that follows these principles with respect to a certain body of knowledge and type of data, the child has a language acquisition procedure, or parser, or whatever.

    Now Bayesian principles don't specify how these principles should be instantiated in an algorithm. There are algorithm recipes for algorithms that follow these Bayesian principles (the Particle Filter algorithm schema is one of these, and from 30,000 feet Charles' algorithm seems similiar to Lisa Pearl's 1-particle particle filter algorithm), and if the child can follow one of these recipes they would have an algorithm guaranteed to eventually converge to the correct hypothesis. But there are at least a couple of challenges here. First, finding an algorithm that follows one of these recipes is often non-trivial for problems with complex structure: e.g., you can probably get a publication in a computational linguistics journal if you can devise a particle filter that learns morpho-phonology. Second, because we know essentially nothing about how hierarchical structures like trees are represented in neural circuitry, we have little idea of what kinds of computations are simple or natural for the brain to perform. So I suspect it's premature to try to identify the algorithms used in acquisition. Instead, I think we should try to identify the knowledge and information used in acquisition; e.g., obtain results along the line of "innate principle X and primary linguistic data Y suffices to infer Z". But of course I really don't know what approach (if any) will help us understand language acquisition.