My inaugural post is the first of a three part sequence on language acquisition, which has become something of a signature issue for this blog. The title, and indeed the theme, are taken from Kahneman's recent book (Thinking, Fast and Slow). The analogy is appropriate. Anyone who has looked at child language--especially when quantitatively and cross linguistically--can see that some parts are learned fast while others proceed at a much more gradual pace. Depending on the department one works in, the fast and the slow side tend to get exaggerated to the point of mutual exclusion; I will name names, I promise.
These posts will be based on some strands of my own work--sorry but I can only talk about things I know. In the later parts of this sequence, I will deal with aspects of child language that decidedly call for either fast or slow learning, but I will kick things off with a case that requires a bit of both: How Children Learn Words.
As Noam and Lila reminded us over the years, word meanings are very complicated and associationist learning is hopeless. A recent anecdote. My daughter goes to a wonderful Montessori school and one of the things they do is called "Word of the Week", when a grownup word is illustrated with examples from pictures, stories, and group activities. So for a long time, our three year old thought "loyalty" meant hugging, "drizzle" meant watering seeds for them to grow, and "coincidence" meant two (specific) girls wearing identical polkadot leggings. Eventually she learned these words; I just have no idea how. Perhaps Jerry Fodor is right after all but I digress.
The task at hand is much simpler. In fact, "How Children Learn Words" is false advertising because research in this business is really "How children learn words that go with medium sized bright objects". But we are aiming lower still: "How children learn which words go with which medium sized bright objects."
Not very exciting for those who worry about type shifting and generalized quantifiers. Yet a sizable cognitive science literature deals is devoted to this very topic, and identifying what goes with what is clearly a necessary component in any theory of word learning. The revival of associationist learning, I suppose, comes from the realization that statistical learning is "more powerful than previously thought". Surely not all words, or every instance of them, will be neatly aligned with their "meanings"--really, and very very loosely, “referents”. But if the target meaning is associated with a word sufficiently frequently, and more frequently than its competitors, then the learner may be able to detect and learn from such statistical correlations. Again, scientists now know we are really good at statistics. The research on cross-situational learning is to see if the learner can tabulate word-meaning co-occurrence statistics over multiple learning instances to figure out which words pick out which objects.
Apparently they can. Consider a typical cross-situational learning study adapted from Smith & Yu Smith (2008) where the subject hears the word "ball" accompanied by two learning instances:
The cross situational learner may construct a mental table that lists the frequencies with which the objects (BALL, BAT, DOG, following Jerry’s CARBURATOR convention) are paired with "ball". It is clear that BALL will be the winner since it co-occurs with “ball” more consistently than others. Several computational models, starting with Jeff Siskind's MIT Thesis in the 1990s, have implemented variants of this idea: some are quite straightforward implementations of cross tabulation (Yu & Ballard 2007, Fazley et al. 2010) while others situate learning in a Bayesian framework that enumerates and evaluates all possible lexicons (Frank et al. 2009).
A few years ago, Jon Stevens, a graduate student at Penn, implemented the variational learning model (Yang 2002; a subspecies of Reinforcement Learning to be featured in later posts) and tested it on some manually annotated video data from CHILDES. Each utterance is treated as two sets, one for a bag of words (i.e., no linguistic relation whatsoever) and the other for a bunch of reasonably observable/identifiable "things" (i.e., medium sized objects) For each word, the learner establishes a probabilistic distribution overs its candidate meanings. When a word is heard, the learner chooses a candidate meaning according to its probability and checks if it's present in the set of things. If so, the associated probability goes up (reward); otherwise, it goes down (penalty). This model worked well enough but was quite a bit worse than the far more complex Bayesian model (Frank et al. 2009). The computational linguist in me knew that any model, however simple, would not get in the game as long as some other model, however complex, was producing higher F-scores.
The variational model for word learning turns out to be too slow and indecisive, hedging its bets not unlike a cross situational learner. In a series of clever and illuminating experiments, my colleagues Lila and John have uncovered some limits of cross situational learning. Here is a briefest summary since the topic has come up before. Not only are subjects incapable of tracking cross situational statistics, they appear to follow a strategy dubbed Propose-but-Verify. If a candidate meaning is confirmed, the learner keeps it; if it is disconfirmed, the learner replaces it with a new candidate meaning. Crucially, the learner appears to assess only one candidate meaning at a time: if a meaning is confirmed, the learner ignores everything else in the scene completely.
Learnability aficionados will recognize that Propose-but-Verify is the triggering model (Gibson & Wexler 1994) in disguise: for "a candidate meaning", read "a candidate grammar". If a grammar (e.g., parameter setting) works well for an input sentence, it is kept; otherwise the learner tries a new setting and moves on. (This is hypothesis testing par excellence, and I never understood why folks would characterize the triggering algorithm as an instance of domain specific learning.) But triggering is too fast for its own good. A tried and true parameter setting can be overturned on the basis of one counter example, possibly a speech error, which brings about serious learnability problems (Berwick & Niyogi 1996).
Lila and John’s experiments have led to a new model of word learning, perhaps the simplest model possible. It combines fast and single minded hypothesis testing with slow and conservative probabilistic learning from the variational model. We call it Pursuit. The learner always tests, or pursues, the most probable word meaning, hence only one hypothesis at any time: the variational model, by contrast, samples among the candidates probabilistically--and that's the key difference. Just as before, if a meaning (the most probable) is confirmed, its probability goes up and if it is disconfirmed, its probability goes down and the learner adds another candidate from the current scene (with some initial probability). Next time when the word is heard, the learner still goes for the most probable candidate, which may well be a failed candidate from before but has not slipped down the pecking order. In other words, failed candidates get to hang around, depending on how dominant they were before failing, rather than being booted out altogether.
The Pursuit model is decidedly non-optimal. Take a look at the BALL-BAT-DOG example. The cross situational models always learn BALL. The Pursuit model only gets it 75% of the time. If you guess BALL in the 1st instance, it will be pursued and succeeds in the 2nd; all good. But if you guess BAT in the 1st instance, which will be disconfirmed in the 2nd, you only have a 50% chance of guessing BALL. Yet surprisingly--no B.S. Norbert, we really weresurprised--the Pursuit model came out Best in Show, in a matter of seconds rather than hours as the Bayesian model requires. Obligatory performance table below; for details, see The Pursuit of Word Meanings.
Here is why. While mommy may not “show them the thing whereof they would have them have the idea; and then repeat to them the name that stands for it, as “white”, “sweet”, “milk”, “sugar”, “cat”, “dog””, as John Locke speculated, they never set up cross situational learning conditions where meanings are uniformly spread out across words and learning instances. If one looks at word learning data children receive, one cannot fail to note that words are indeed embedded in multiply ambiguous settings. Yet every now and then, mommy does point to “cake” and say “cake”, and there is a ball around, and bright orange bouncy one at that, when the child hears “ball!”. The Pursuit model latches on to these low ambiguity learning instances, even though they are few and far between, and pursues them with abandon. These relatively salient cues are diluted under cross situational learning, which averages everything including many (more) highly noisy ones. The artificial world of cross situational learning does favor the cross situational model but bears little relevance to reality, which is indeed one of the main points in Lila and John’s papers.
A very small step toward word learning, to be sure. But even a small step needs to be taken deliberately, neither too fast nor too slow.