Now Norbert has broken the Bayesian spell, while bringing up the word learning work we have done in the past, I’d share some thoughts that have been on my mind for quite some time, which also may go right to the heart of the issues raised in Norbert's blog. Why, and when, does a resource limited learner like Pursuit that keeps track of few things outperform an idealized learner such as a cross situational learning model, one which keeps track of everything?
To refamiliarize you with this corner of the world, there are two views of word learning. A global learner keeps track all of the word-reference pairings in all the learning instances and outputs the best hypothesis. Quite clearly, a global learner observing the following 5 learning instances will conclude that “mipen” must mean elephant for it gets a higher score than the other hypotheses.
Pursuit, by contrast, works differently and locally. For simplicity, let’s just keep a numerical score. If the learner guesses a meaning M and it’s confirmed, then M gets a boost of 1; otherwise it loses a point. Nevertheless, the learner always tries out the meaning that has the highest score: if it fails to be found, it will take a hit but the learner does not go down the list to try the 2nd best meaning but will grab one (N) from the environment. If it has guessed N with “mipen” before, then its score gets an additional boost; otherwise it starts with the score of 1. It is easy to see that a previously highly ranked hypothesis will not be immediately booted out upon one instance of disconfirmation, adding robustness to learning. At the same time, if the learner has conjectured a wrong meaning and inadvertently built some confidence in it—say, mistaking CAT with DOG since both pets are consistently co-present in the living room—the right hypothesis may eventually catch up in the long run. For the same sequence of learning instances, the learner might have done the following:
Elephant receives a boost 3 times in a roll. Even when it takes a hit in instance 4, and the learner adds cat to the hypothesis list with a score of 1, next time “mipen” is heard, elephant, by the virtue of being the highest (2>1), will be pursued and ends back up at 3. Note that at any instance, the number of hypotheses considered is strictly limited: one, if the best is confirmed, and two, if the best is disconfirmed and another one is added. The learner is oblivious to all other candidates.
There are reasons to believe, at least tentatively, that in actual word learning situations, a local learner will necessarily outperform the global learner.
You need to bear with me a bit here. Pursuit turns out to be impossible to analyze formally in the general case (at least by me). So let’s consider a Pursuit-like local model. For each instance, the learner will pick a meaning from the environment and adds 1 to its score. That is, unlike Pursuit, this counting model treats the learning instances as independent. And the model outputs the meaning with the highest score at the end of learning. That’s it.
Let’s consider a hypothetical learning instance: How to learn “dog” means DOG rather than CAT? As far as the model suburban middle class family is concerned, DOG and CAT are both present in the house and the poor kid—as if middle class and suburban aren’t hard enough—really has no way of knowing which is which. However, the DOG occasionally goes for a walk and the CAT stays at home. So let there be N instances (e.g, home) where DOG and CAT are co-present, and the learner has a uniform probability of PN of selecting a meaning (DOG, CAT, COUCH, etc.), and let there be U disambiguating instances (e.g., park, beach) where the CAT isn’t around with the DOG and the learner has a uniform probability of PU of selecting a meaning (DOG, SWING-SET, CRAB, etc.) Viewed this way, the learning of “dog” becomes a race between DOG and CAT: we need to quantity the effect of the disambiguating U instances that pushes DOG above CAT, for without them, DOG and CAT will never be teased apart.
Consider a global cross situational model. There are (N+U) instances, but only U of them favor DOG over CAT: the statistical advantage of DOG over CAT is therefore:
There is a variant of the global model. As shown by Yu & Smith (2009, Psych. Sci.), subjects are better at picking out the target when it’s presented with fewer distractors than more (but never at ceiling; see footnote 1). So one can construct a model—thanks reviewer #3!—that gives more weight to the less ambiguous learning instances, and less weight to the more ambiguous learning instances. (The ambiguities are provided by PN and PU in our formulation: bigger probability = less ambiguity.) The statistical advantage of DOG over CAT is:
For the Pursuit-like counter model, this is literally a race of counting. In the N instances, the learner may sometimes choose DOG, sometimes CAT, and sometimes other things. There are three possibilities:
- If DOG is ahead of CAT during the N instances, then the U disambiguating instances offer no additional advantage for DOG since it’s already won.
- If CAT is at least U steps ahead of DOG during the N instances, then DOG will not be able to catch during U even if the learner were to choose DOG every single time. The best it can manage is a tie: DOG is dead.
- If CAT is i (0≤i<U) steps ahead of DOG during the N instances, then DOG could in principle catch up, if it happens to be selected at least (i+1) times during U.
And all we need to do is to calculate the probability of (3): that is the weight of evidence afforded by U in favor of DOG. Hence the monstrosity below, which is no more than a couple of multinomials (the independence assumption enables multiplication):
Formally analyzing these is impossible (again, and again for me), but a few plots are fairly instructive. Set N = 100, Pn = 0.25, and let’s see a range of values for Pu at 0.25, 0.75, and 0.1: that is, the scenes that do not contain CAT are equally, less and more ambiguous than the scenes that contain both DOG and CAT.
I played around with these a bit. We see that the local model does not always make better use of the disambiguating data, as in the case of Pu=0.1 (the last plot), but it often does. In general, it seems like if U is small relative to N, and Pu is not much smaller than Pn, the local model gets a bigger jump from U than the global model. This confirms our earlier finding that a cross situational learner dilutes the few but highly informative learning instances: Δp (Pursuit-like local learner) is much bigger than Δx (cross situational learner) and Δwx (weighted cross situational learner) especially when Pu >PN. The intuition is quite simple. For the N instances, especially if N is modestly large, CAT and DOG will be more or less a tie. During the U instances, the probability of not picking DOG at least some of times, thereby building an advantage over CAT, decreases close to exponentially.
Lila, John and their colleagues have done considerable work to analyze the actual degree of referential ambiguity of word learning (Cartmill et al. 2013, PNAS), using the tried and true Human Simulation Paradigm (Gillette et al. 1999). The general conclusion is congruent with the formal analysis here as well as our earlier simulations. The referential ambiguity of words is generally very high. In most cases, people (i.e., college undergraduates who are socially and linguistically very capable) are pretty bad at guessing word meanings but occasionally everyone gets it right. These moments of epiphany are rare but incredibly useful: so Pursue them!
-----* Congratulations to Elissa Newport, winner of the 2015 Benjamin Franklin medal for Computer and Cognitive Science: hence the title as a fitting tribute to one of her famous suggestions. I also thank my co-conspirators Jon, John and Lila for discussion. They agree with me in spirit though they should not be held responsible for any faux pas that I may have committed here.
 Note that strictly speaking, a “true” cross situational learner should choose DOG over CAT 100% of the time because DOG is present more often than CAT (if U>0). But in practice, this is not how things work. First, if this were how cross situational learning worked, then human subjects in all such learning conditions would be at 100% ceiling: no one does that well, and all the evidence for cross situational learning is established on the result that the learner is more likely to select the “target” than the distractors, but nowhere near perfection. (Indeed, a cross situational learning model outperforms human subjects in a typical cross situational learning experiment: here doing better is not a good thing.) Second, if we place ourselves in a computational setting, a model can only learn a meaning if the meanning is considerably better than its competitors. This turns into the dark art of parameter tuning: find a threshold for the weight of evidence so that anything above the line is considered learned and anything below is for further studies. Believe me: just taking the winner regardless its advantage over the competitors will give you a lot of false positives on anything remotely resembling actual language data. Thus, the cross situational learner is in practice probabilistic, and the term Δx quantifies the statistical weight of evidence for DOG over CAT.