Monday, March 11, 2013

Learning Fast and Slow I: How Children Learn Words



My inaugural post is the first of a three part sequence on language acquisition, which has become something of a signature issue for this blog. The title, and indeed the theme, are taken from Kahneman's recent book (Thinking, Fast and Slow). The analogy is appropriate. Anyone who has looked at child language--especially when quantitatively and cross linguistically--can see that some parts are learned fast while others proceed at a much more gradual pace. Depending on the department one works in, the fast and the slow side tend to get exaggerated to the point of mutual exclusion; I will name names, I promise.

These posts will be based on some strands of my own work--sorry but I can only talk about things I know. In the later parts of this sequence, I will deal with aspects of child language that decidedly call for either fast or slow learning, but I will kick things off with a case that requires a bit of both: How Children Learn Words.

As Noam and Lila reminded us over the years, word meanings are very complicated and associationist learning is hopeless.  A recent anecdote. My daughter goes to a wonderful Montessori school and one of the things they do is called "Word of the Week", when a grownup word is illustrated with examples from pictures, stories, and group activities. So for a long time, our three year old thought "loyalty" meant hugging, "drizzle" meant watering seeds for them to grow, and "coincidence" meant two (specific) girls wearing identical polkadot leggings. Eventually she learned these words; I just have no  idea how. Perhaps Jerry Fodor is right after all but I digress. 

The task at hand is much simpler. In fact, "How Children Learn Words" is false advertising because research in this business is really "How children learn words that go with medium sized bright objects". But we are aiming lower still: "How children learn which words go with which medium sized bright objects." 

Not very exciting for those who worry about type shifting and generalized quantifiers. Yet a sizable cognitive science literature deals is devoted to this very topic, and identifying what goes with what is clearly a necessary component in any theory of word learning. The revival of associationist learning, I suppose, comes from the realization that statistical learning is "more powerful than previously thought". Surely not all words, or every instance of them, will be neatly aligned with their "meanings"--really, and very very loosely, “referents”. But if the target meaning is associated with a word sufficiently frequently, and more frequently than its competitors, then the learner may be able to detect and learn from such statistical correlations. Again, scientists now know we are really good at statistics. The research on cross-situational learning is to see if the learner can tabulate word-meaning co-occurrence statistics over multiple learning instances to figure out which words pick out which objects. 

Apparently they can.  Consider a typical cross-situational learning study adapted from Smith & Yu Smith (2008) where the subject hears the word "ball" accompanied by two learning instances:

The cross situational learner may construct a mental table that lists the frequencies with which the objects (BALL, BAT, DOG, following Jerry’s CARBURATOR convention) are paired with "ball". It is clear that BALL will be the winner since it co-occurs with “ball” more consistently than others.  Several computational models, starting with Jeff Siskind's MIT Thesis in the 1990s, have implemented variants of this idea: some are quite straightforward implementations of cross tabulation (Yu & Ballard 2007, Fazley et al. 2010) while others situate learning in a Bayesian framework that enumerates and evaluates all possible lexicons (Frank et al. 2009). 

A few years ago, Jon Stevens, a graduate student at Penn, implemented the variational learning model (Yang 2002; a subspecies of Reinforcement Learning to be featured in later posts) and tested it on some manually annotated video data from CHILDES. Each utterance is treated as two sets, one for a bag of words (i.e., no linguistic relation whatsoever) and the other for a bunch of reasonably observable/identifiable "things" (i.e., medium sized objects) For each word, the learner establishes a probabilistic distribution overs its candidate meanings. When a word is heard, the learner chooses a candidate meaning according to its probability and checks if it's present in the set of things. If so, the associated probability goes up (reward); otherwise, it goes down (penalty). This model worked well enough but was quite a bit worse than the far more complex Bayesian model (Frank et al. 2009). The computational linguist in me knew that any model, however simple, would not get in the game as long as some other model, however complex, was producing higher F-scores. 

The variational model for word learning turns out to be too slow and indecisive, hedging its bets not unlike a cross situational learner. In a series of clever and illuminating experiments, my colleagues Lila and John have uncovered some limits of cross situational learning.  Here is a briefest summary since the topic has come up before. Not only are subjects incapable of tracking cross situational statistics, they appear to follow a strategy dubbed Propose-but-Verify. If a candidate meaning is confirmed, the learner keeps it; if it is disconfirmed, the learner replaces it with a new candidate meaning. Crucially, the learner appears to assess only one candidate meaning at a time: if a meaning is confirmed, the learner ignores everything else in the scene completely. 

Learnability aficionados will recognize that Propose-but-Verify is the triggering model (Gibson & Wexler 1994) in disguise: for "a candidate meaning", read "a candidate grammar". If a grammar (e.g., parameter setting) works well for an input sentence, it is kept; otherwise the learner tries a new setting and moves on. (This is hypothesis testing par excellence, and I never understood why folks would characterize the triggering algorithm as an instance of domain specific learning.) But triggering is too fast for its own good. A tried and true parameter setting can be overturned on the basis of one counter example, possibly a speech error, which brings about serious learnability problems (Berwick & Niyogi 1996).  

Lila and John’s experiments have led to a new model of word learning, perhaps the simplest model possible. It combines fast and single minded hypothesis testing with slow and conservative probabilistic learning from the variational model. We call it Pursuit. The learner always tests, or pursues, the most probable word meaning, hence only one hypothesis at any time: the variational model, by contrast, samples among the candidates probabilistically--and that's the key difference. Just as before, if a meaning (the most probable) is confirmed, its probability goes up and if it is disconfirmed, its probability goes down and the learner adds another candidate from the current scene (with some initial probability). Next time when the word is heard, the learner still goes for the most probable candidate, which may well be a failed candidate from before but has not slipped down the pecking order.  In other words, failed candidates get to hang around, depending on how dominant they were before failing, rather than being booted out altogether.

The Pursuit model is decidedly non-optimal. Take a look at the BALL-BAT-DOG example. The cross situational models always learn BALL. The Pursuit model only gets it 75% of the time. If you guess BALL in the 1st instance, it will be pursued and succeeds in the 2nd; all good. But if you guess BAT in the 1st instance, which will be disconfirmed in the 2nd, you only have a 50% chance of guessing BALL.  Yet surprisingly--no B.S. Norbert, we really weresurprised--the Pursuit model came out Best in Show, in a matter of seconds rather than hours as the Bayesian model requires.  Obligatory performance table below; for details, see The Pursuit of Word Meanings.


Model
Precision
        Recall                
F-score
cross-situational                 
0.28
        0.21
0.24
Propose-but-Verify
0.04
        0.31
0.08
Bayesian
0.50   
        0.29
0.37
Pursuit
0.44
        0.38
0.41


Here is why. While mommy may not “show them the thing whereof they would have them have the idea;  and then repeat to them the name that stands  for it, as “white”, “sweet”, “milk”, “sugar”, “cat”, “dog””, as John Locke speculated, they never set up cross situational learning conditions where meanings are uniformly spread out across words and learning instances. If one looks at word learning data children receive, one cannot fail to note that words are indeed embedded in multiply ambiguous settings. Yet every now and then, mommy does point to “cake” and say “cake”, and there is a ball around, and bright orange bouncy one at that, when the child hears “ball!”. The Pursuit model latches on to these low ambiguity learning instances, even though they are few and far between, and pursues them with abandon.  These relatively salient cues are diluted under cross situational learning, which averages everything including many (more) highly noisy ones. The artificial world of cross situational learning does favor the cross situational model but bears little relevance to reality, which is indeed one of the main points in Lila and John’s papers.

"Shoe."

A very small step toward word learning, to be sure. But even a small step needs to be taken deliberately, neither too fast nor too slow.




6 comments:

  1. Very interesting post, Charles. Can I ask one thing -- you say "associationist learning is hopeless.", but it seems like your model is quite associationist -- indeed variational learning is as you point out quite old school. So what is the difference between the 'bad' associationism, and 'good' associationism? Is it the nature of the hypothesis space -- highly structured versus unstructured?

    ReplyDelete
    Replies
    1. Hi Alex: The hypothesis space makes all the difference. Even in the simple case like this, where all the learner does is to figure out the associations, an enormous amount of hypothesis space structuring goes into it, e.g., stuff like the whole object constraint, mutual exclusivity (a probabilistic version; see the paper). Other constraints are essentially built in how we annotated the video data, including word segmentation.

      Delete
  2. From Lila:

    Replying both to Charles and to Alex Clark: Yes, John Trueswell and I do pro tem endorse the model that we worked on with Stevens and Yang (see our joint CS proceedings paper, Stevens/Yang/Trueswell/Gleitman, in review). This is even though, as Alex Clark quickly notices, the model as it stands has a little associationist machine inside it. The interest of this view, in our opinion, is that the mini-preservation of past experience has some potential utility for understanding how the learner could acquire homonyms (notice that John’s and my model, Propose but Verify) has trouble just in this regard. But as long as we’re talking here, I want to point out that, in the end, I have the same reservations about this aspect of the Pursuit model as does Alex, namely that it is underlyingly an associationist scheme in disguise, however limited.

    As Charles points out in his reply to Alex, the necessity is to deal with the real and exploding ambiguity that every observation presents (see our pictures of “laboratory” and “real” input circumstances, reproduced in the two panels in Charles’ blog), and the Pursuit model does it via some species of limited/controlled cumulative learning, along with some background constraints on representation (e.g., the “whole object bias” and “mutual exclusivity” both a la Ellen Markman). But such constraints, while required, don’t by themselves come near to narrowing the hypothesis space sufficiently to enable word learning. Instead or in addition, there ways of avoiding error (false conjectures) that do not involve preservation of past experience at all, and indeed our recent findings (stay tuned for the papers documenting these, which are now in preparation) demonstrate that by and large people use this alternative machinery; namely, they use a species of pre-filtering such that “low quality information” (see our original paper, the empirical data source for the Pursuit model) is simply ignored, and never even enters into the search for meaning. The finding is that, with considerable reliability, subjects can distinguish between exposures to a word in context that present good opportunities for word learning, and those that do not; and so they attempt their word learning only on the former subparts of the actual input data. One of these bases by which subjects really do select a meaning, given an observation, is via their exquisite sensitivity to the timing properties of this “high quality input” vis a vis the utterance of a new word. So the short story here is that our findings have led us to a position in which the learner pre-selects usable input (and so avoids error) rather than optimizing over a set of inputs (and so recovering from error). And thank heaven, because the learners would otherwise fail because they have tiny and unreliable remembrance of things past.

    ReplyDelete
  3. I wonder if there might not be some version of the basic idea here that worked for syntax, for example utterances such as 'who do you want to read you a story' might provide could occasions for learning the basic constraint to to-contraction, whatever it is, because the context makes them intelligible and the subject matter interesting (and they might get said several times a week at least).

    ReplyDelete
  4. Avery: I did a quick search in about half a million sentences of child directed English--about half a year's learning data--but did not find any instance of the informative question you have in mind.

    ReplyDelete
  5. Yes, but was it collected under the conditions under which this one or its variants would appear, perhaps more people wiring their houses like Deb Roy will answer such questions. I looked through the English language portions of the Childes database for such things, as has Andrea Zukowski, without finding very much, but the investigators aren't usually there when the kids are being put to bed, nor in many other kinds of routine situations.

    ReplyDelete