In some earlier posts (e.g. here, here), I discussed a theory of word acquisition developed by Medina, Snedeker, Trueswell and Gleitman (MSTG) that I took to question whether learning in the classical sense ever takes place. MSTG propose a theory they dub “Propose-but-Verify” (PbV) that postulates that word learning in kids is (i) essentially a one trial process where everything but the first encounter with a word is essentially irrelevant, (ii) at any given time only one hypothesis is being entertained (i.e. there is no hypothesis testing;/comparison going on) and (iii) that updating only occurs if the first guess is disconfirmed, and then it occurs pretty rapidly. MSTG’s theory has two important features. First, it proceeds without much counting of any sort, and second, the hypothesis space is very restricted (viz. it includes exactly one hypothesis at any given time). These two properties leave relatively little for stats to do as there is no serious comparison of alternatives going on (as there’s only one candidate at a time and it gets abandoned when falsified).
This story was always a little too good to be true. After all, it seems quite counterintuitive to believe that single instances of disconfirmation would lead word acquirers (WA) to abandon a hypothesis. And not surprisingly, as is often the case, those things too good to be true might not be. However, a later reconsideration of the same kind of data by a distinguished foursome (partially overlapping, partially different) argues that the earlier MSTG model is “almost” true, if not exactly spot on.
In a new paper (here) Stevens, Yang, Trueswell and Gleitman (SYTG) adopt (i)-(iii) but modify it to add a more incremental response to relevant data. The new model, like that older MSTG one, rejects “cross situational learning” which SYTG take to involve “the tabulation of multiple, possibly all, word-meaning associations across learning instances” (p.3) but adds a more gradient probabilistic data evaluation procedure. The process works as follows. It has two parts.
First, for “familiar” words, this account, dubbed “Pursuit with abandon” (p. 3) (“Pursuit” (P) for short), selects the single most highly valued option (just one!) and rewards it incrementally if consistent with the input and if not it decreases its score a bit while also randomly selecting a single new meaning from “the available meanings in that utterance” (p. 2) and rewarding it a bit. This take-a-little, give-a-little is the stats part. In contrast to PbV, P does not completely dump a disconfirmed meaning, but only lowers its overall score somewhat. Thus, “a disconfirmed meaning may still remain the most probable hypothesis and will be selected for verification the next time the word is presented in the learning data” (p. 3). SYTG note that replacing MSTG’s one strike you’re out “counting” procedure, with a more gradient probabilistic evaluation measure adds a good deal of “robustness” to the learning procedure.
Second, for novel words, P encodes “a probabilistic form of the Mutual Exclusivity Constraint…[viz.] when encountering novel words, children favor mapping to novel rather than familiar meanings” (p. 4). Here too the procedure is myopic, selecting one option among many and sticking with it until it fails enough to be replaced via step one above.
Thus, the P model, from what I can tell, is effectively the old PbV model but with a probabilistic procedure for, initially, deciding on which is the “least probable” candidate (i.e. to guide an initial pick) and for (dis)confirming a given candidate (i.e. to up/downgrade a previously encountered entry). Like the PbV, P is very myopic. Both reject cross situational learning and concentrate on one candidate at a time, ignoring other options if all goes well and choosing at random if things go awry.
This is the P model. Using simulations based on Childes data, the paper goes on to show that this system is very good when compared both with PbV and, more interestingly, with more comprehensive theories that keep many hypothesis in play throughout the acquisition process. To my mind, the most interesting comparison is with Bayesian approaches. I encourage you to take a look at the discussion of the simulations (section 3 in the paper). The bottom line is that the P model bested the three others on overall score, including the Bayesian alternative. Moreover, SYTG was able to identify the main reason for the success: non-myopic comprehensive procedures fail to sufficiently value “high informative cues” provided early in the acquisition process. Why? Because comprehensive comparison among a wide range of alternatives serves to “dilute the probability space” for correct hits, thereby “making the correct meaning less likely to be added to the lexicon” (P. 6-7). It seems that in the acquisition settings found in CHILDES (and in MSTGs more realistic visual settings), this dilution prevents WAs from more rapidly building up their lexicons. As SYTG put it:
The advantage of the Pursuit model over cross-situational models derives from its apparent sub-optimal design. The pursuit of the most favored hypothesis limits the range of competing meanings. But at the same time, it obviates the dilution of cues, especially the highly saline first scene…which is weakened by averaging with more ambiguous leaning instances…[which] are precisely the types of highly salient instances that the learner takes advantage of…(p. 7).
There is a second advantage of the P model as compared to a more sophisticated and comprehensive Bayesian approach. SYTG just touch on this, but I think it is worth mentioning. The Bayesian model is computationally very costly. In fact, SYTG notes that full simulations proved impractical as “each simulation can take several hours to run” (p. 8). Scaling up is a well-known problem for Bayesian accounts (see here), which is probably why Bayesian proposals are often presented as Marrian level 1 theories rather than actual algorithmic procedures. At any rate, it seems that the computational cost stems from precisely the feature that makes Bayesian models so popular: their comprehensiveness. The usual procedure is to make the hypothesis space as wide as possible and then allow the “data” to find the optimal one. However, it is precisely this feature that makes the obvious algorithm built on this procedure intractable.
In effect, SYTG show the potential value of myopia, i.e. in very narrow hypothesis spaces. Part of the value lies in computational tractability. Why? The narrower the hypothesis space, the less work is required of Bayesian procedures to effectively navigate the space of alternatives to find the best candidate. In other words, if the alternatives are few in number, the bulk of explaining why we see what we get will lie not with fancy evaluation procedures, but with the small set of options that are being evaluated. How to count may be important, but it is less important the fewer things there are to count among. In the limit, sophisticated methods of counting may be unnecessary, if not downright unproductive.
The theme that comprehensiveness may not actually be “optimal” is one that SYTG emphasize at the end of their paper. Let me end this little advertisement by quoting them again:
Our model pursues the [i.e. unique NH] highly valued, and thus probabilistically defined, word meaning at the expense of other meaning candidates. By contrast, cross-situational models do not favor any one particular meaning, but rather tabulate statistics across learning instances to look for consistent co-occurrences. While the cross-situational approach seems optimally designed [my emph, NH], its advantage seems outweighed by its dilution effects that distract the learner away from clear unambiguous learning instances…It is notable that the apparently sub-optimal Pursuit model produces superior results over the more powerful models with richer statistical information about words and their associated meanings: word learning is hard, but trying to hard may not help.
I would put this slightly differently: it seems that what you choose to compare may be as (more?) important than how you choose to compare them. SYTG reinforces MSTG’s earlier warning about the perils of open-mindedness. Nothing like a well designed narrow hypothesis space to aid acquisition. I leave the rationalist/empiricist overtones of this as an exercise for the reader.