In some earlier posts (e.g. here,
here),
I discussed a theory of word acquisition developed by Medina, Snedeker,
Trueswell and Gleitman (MSTG) that I took to question whether learning in the
classical sense

*ever*takes place. MSTG propose a theory they dub “Propose-but-Verify” (PbV) that postulates that word learning in kids is (i) essentially a one trial process where everything but the first encounter with a word is essentially irrelevant, (ii) at any given time only one hypothesis is being entertained (i.e. there is no hypothesis testing;/comparison going on) and (iii) that updating only occurs if the first guess is disconfirmed, and then it occurs pretty rapidly. MSTG’s theory has two important features. First, it proceeds without much counting of any sort, and second, the hypothesis space is*very*restricted (viz. it includes exactly one hypothesis at any given time). These two properties leave relatively little for stats to do as there is no serious*comparison*of alternatives going on (as there’s only one candidate at a time and it gets abandoned when falsified).
This story was always a little too good to be true. After
all, it seems quite counterintuitive to believe that single instances of disconfirmation
would lead word acquirers (WA) to abandon a hypothesis. And not surprisingly, as is often the case,
those things too good to be true might not be. However, a later reconsideration
of the same kind of data by a distinguished foursome (partially overlapping,
partially different) argues that the earlier MSTG model is “almost” true, if not exactly spot on.

In a new paper (here)
Stevens, Yang, Trueswell and Gleitman (SYTG) adopt (i)-(iii) but modify it to
add a more incremental response to relevant data. The new model, like that
older MSTG one, rejects “cross situational learning” which SYTG take to involve
“the tabulation of multiple, possibly all, word-meaning associations across
learning instances” (p.3) but adds a more gradient

*probabilistic*data evaluation procedure. The process works as follows. It has two parts.
First, for “familiar” words, this account, dubbed “Pursuit
with abandon” (p. 3) (“Pursuit” (P) for short), selects the single most highly
valued option (just one!) and rewards it incrementally if consistent with the
input and if not it decreases its score a bit while also randomly selecting a
single new meaning from “the available meanings in that utterance” (p. 2) and
rewarding it a bit. This take-a-little, give-a-little is the stats part. In
contrast to PbV, P does not completely dump a disconfirmed meaning, but only
lowers its overall score somewhat. Thus,
“a disconfirmed meaning may still remain the most probable hypothesis and will
be selected for verification the next time the word is presented in the
learning data” (p. 3). SYTG note that replacing MSTG’s one strike you’re out
“counting” procedure, with a more gradient probabilistic evaluation measure
adds a good deal of “robustness” to the learning procedure.

Second, for novel words, P encodes “a probabilistic form of
the Mutual Exclusivity Constraint…[viz.] when encountering novel words,
children favor mapping to novel rather than familiar meanings” (p. 4). Here too the procedure is myopic, selecting
one option among many and sticking with it until it fails enough to be replaced
via step one above.

Thus, the P model, from what I can tell, is effectively the
old PbV model but with a probabilistic procedure for, initially, deciding on
which is the “least probable” candidate (i.e. to guide an initial pick) and for
(dis)confirming a given candidate (i.e. to up/downgrade a previously
encountered entry). Like the PbV, P is
very myopic. Both reject cross situational learning and concentrate on one
candidate at a time, ignoring other options if all goes well and choosing at
random if things go awry.

This is the P model. Using simulations based on Childes
data, the paper goes on to show that this system is very good when compared
both with PbV and, more interestingly, with more comprehensive theories that
keep many hypothesis in play throughout the acquisition process. To my mind,
the most interesting comparison is with Bayesian approaches. I encourage you to
take a look at the discussion of the simulations (section 3 in the paper). The bottom line is that the P model bested
the three others on overall score, including the Bayesian alternative. Moreover, SYTG was able to identify the main
reason for the success: non-myopic comprehensive procedures fail to
sufficiently value “high informative cues” provided early in the acquisition
process. Why? Because comprehensive
comparison among a wide range of alternatives serves to “dilute the probability
space” for correct hits, thereby “making the correct meaning less likely to be
added to the lexicon” (P. 6-7). It seems
that in the acquisition settings found in CHILDES (and in MSTGs more realistic visual
settings), this dilution prevents WAs from more rapidly building up their
lexicons. As SYTG put it:

The advantage of the Pursuit model
over cross-situational models derives from its apparent sub-optimal design. The
pursuit of the most favored hypothesis limits the range of competing meanings.
But at the same time, it obviates the dilution of cues, especially the highly
saline first scene…which is weakened by averaging with more ambiguous leaning
instances…[which] are precisely the types of highly salient instances that the
learner takes advantage of…(p. 7).

There is a second advantage of the P model as compared to a
more sophisticated and comprehensive Bayesian approach. SYTG just touch on this, but I think it is worth
mentioning. The Bayesian model is computationally very costly. In fact, SYTG
notes that full simulations proved impractical as “each simulation can take
several hours to run” (p. 8). Scaling up
is a well-known problem for Bayesian accounts (see here),
which is probably why Bayesian proposals are often presented as Marrian level 1
theories rather than actual algorithmic procedures. At any rate, it seems that
the computational cost stems from precisely the feature that makes Bayesian
models so popular: their comprehensiveness. The usual procedure is to make the
hypothesis space as wide as possible and then allow the “data” to find the
optimal one. However, it is precisely this feature that makes the obvious
algorithm built on this procedure intractable.

In effect, SYTG show the potential value of myopia, i.e. in
very narrow hypothesis spaces. Part of the value lies in computational
tractability. Why? The narrower the hypothesis space, the less work is required
of Bayesian procedures to effectively navigate the space of alternatives to
find the best candidate. In other words,
if the alternatives are few in number, the bulk of explaining why we see what
we get will lie not with fancy evaluation procedures, but with the small set of
options that are being evaluated. How to count may be important, but it is less
important the fewer things there are to count among. In the limit,
sophisticated methods of counting may be unnecessary, if not downright
unproductive.

The theme that comprehensiveness may not actually be
“optimal” is one that SYTG emphasize at the end of their paper. Let me end this
little advertisement by quoting them again:

Our model pursues the [i.e. unique
NH] highly valued, and thus probabilistically defined, word meaning at the
expense of other meaning candidates. By
contrast, cross-situational models do not favor any one particular meaning, but
rather tabulate statistics across learning instances to look for consistent
co-occurrences.

*While the cross-situational approach seems optimally designed*[my emph, NH], its advantage seems outweighed by its dilution effects that distract the learner away from clear unambiguous learning instances…It is notable that the apparently sub-optimal Pursuit model produces superior results over the more powerful models with richer statistical information about words and their associated meanings: word learning is hard, but trying to hard may not help.
I would put this slightly differently: it seems that what
you choose to compare may be as (more?) important than how you choose to
compare them. SYTG reinforces MSTG’s earlier warning about the perils of
open-mindedness. Nothing like a well designed narrow hypothesis space to aid
acquisition. I leave the rationalist/empiricist overtones of this as an
exercise for the reader.

I'd like to encourage you to look again at Bayesian approaches and probabilistic models more generally, such as Mike Frank's and the ones we've developed in work on "Adaptor Grammars". Yes, some of these models have complicated mathematical derivations, which is often used as an excuse to ignore them as "psychologically unrealistic". But I claim this is a level confusion!

ReplyDeleteA probabilistic model is what Marr called a "computational model"; it specifies the different kinds of information involved and how they interact. For models of potentially unbounded things such as the lexicon the maths here is still quite new and unfamiliar to many. I've heard the apparent complexity of this maths used as argument against Bayesian models, but this is as misguided as complaining that there's no way the planets could be doing anything as complicated as solving differential equations when they move around the sun.

One interesting property of probabilistic models is that there are a variety of relatively generic inference algorithms for inference ("learning"). We know there are trade-offs here: the computationally-intensive models you complain about are so intensive precisely because they are searching for optimal solutions.

But there are a variety of other Bayesian algorithms that trade optimality for other properties, such as "on-line" learning (i.e., exactly one pass through the data, see e.g. http://aclweb.org/anthology-new/U/U11/U11-1004.pdf).

Interestingly, some of these algorithms have the kind of properties you laud in your article (e.g., they maintain only a single analysis of each data item, rather than tracking all possible combinations), and in fact many of the algorithms you speak of approvingly can be viewed as approximations to some of these Bayesian algorithms.

A more theoretical approach such as a Bayesian one provides two big advantages over more ad hoc approaches.

First, it helps us understand why these algorithms have the structure they do (e.g., why do we smooth by adding rather than multiplying?). Understanding the approximations involved in the algorithm can help us understand what has happened when our models go wrong (and all our current models are wrong).

Second, and more importantly, it's usually easier to see how to extend and combine probabilistic models than algorithms. For example, associating words with topics as is done here is just one of many challenges in word learning; merely identifying the words in an utterance is a challenge. While there are algorithms for word segmentation, it's not at all obvious how to combine such an algorithm with the algorithm described in this post. However, at the level of probabilistic models this is not so challenging: we've shown how such models can be integrated, and perhaps more interestingly, we find synergies in their combination (i.e., word segmentation improves, as does learning word-topic mapping!) http://books.nips.cc/papers/files/nips23/NIPS2010_0530.pdf .

I could go on explaining how the algorithm for this model only keeps a small number of analyses for each possible word, but this post is long enough already ...

Thx. As you might guess, this level of detail is beyond me. But I encourage the cognoscenti to jump in here. It would be nice to get a heads up from the pros on what to look at and why.

ReplyDeleteThought I would chime in, as this post touches on my work, as well as topics of my general interest.

ReplyDeleteI am in complete agreement with Norbert and Mark, that probabilistic methods have proven utility for language acquisition. I took a fair amount of stick for it when I started out with probabilistic parameter setting back in the 90s, and still do, but at least I have company. And I have no problem with Bayesian methods either, having cut my my Bayesian teeth with folks like Mike Jordan. The reinforcement learning model I have been using for language acquisition and change can be straightforwardly recast in a Bayesian framework, which is very inclusive.

The only concern I have, and the goal of the Pursuit model (and John and Lila's experimental work), is to get the empirical matters of language learning right. Yes, Bayesian models are very good at incorporating multiple sources of information: the word segmentation work that Mark mentioned is a good example of that. But from the developmental point of view, it's not clear that this is what infants do. The work of Johnson & Jusczyk (2001, JML), and Shkula et al. (2011, PNAS), for instance, shows that when both prosodic and statistical information are available, infants use the former and ignore the latter. These results are quite representative of the acquisition literature. And the results from John, Lila, and colleagues' experiments also show the absence of information tracking over multiple scenes. Of course a Bayesian model can be construed to capture such findings, but why not call an orange an orange, and build a computational model, such as Pursuit, to transparently reflect this? For word segmentation, see the work of Constantine Lignos (http://www.aclweb.org/anthology-new/W/W11/W11-0304.pdf).

I have heard of, and used, the complexity argument before: Method X is too computationally intensive to be psychological plausible. But I haven't used this recently, because it is too easily defeated: "we don't know how the brain works", and that's a fair point.

For practical purposes--we want to get the results back before dinner!--approximation methods are probably the way to go, even though there are intractability results on approximate methods as well. But sometimes the online approximation method alone (such as Constantine's work) may be sufficient for the job, sans the more complex model it tries to approximate. It'd be like building an oven to light a candle, but the oven needs to be lit by a match.

(post 1/2)

ReplyDeleteThat's right. Let's call an orange an orange. But we are doing just that when we construct a Bayesian model and then specify different ways to investigate the posterior distribution of one or more of its parameters. We are just doing it as a part of our botanical investigation of the orange tree. As Mark points out, parameter spaces are often too large to carry out such an investigation exhaustively, so we resort to approximating algorithms. The advantage of having the computational level context specified for these algorithms---and this is Mark's point (or at least related)---is that we can then ask how different forms of approximation affect the empirical coverage. That is, we can ask why the "sub-optimal way of making use of information across situations...actually performs better" (Stevens, Yang, Trueswell & Gleitman 2013: 3) in certain contexts. I think the relevant oven analogy might be that we'd like to understand how an oven does its job for those times when we just have a fire and a spit but would like to get as close as possible to the oven's performance. It's true that we need to get the cooking done well. But the knowledge we gain about the relationship between oven-cooking and fire-and-spit-cooking gives us a way of understanding cooking as a whole.

In a similar vein to the Borschinger and Johnson paper that Mark links to, Sanborn, Griffiths & Navarro (2010; http://cocosci.berkeley.edu/tom/papers/rationalapproximations.pdf) have an excellent paper comparing different online approximation algorithms for category learning. One of the algorithms they discuss is Anderson's (1990, 1991), which they call Local MAP, since it samples from a Dirac distribution at the mode of posterior over categories given the previous datapoints and category samples. This isn't the same as Pursuit, since Pursuit assumes a distribution over a finite number of categories and Local MAP is nonparametric (and there are probably other differences in how belief updates are handled), but Pursuit does use a similar sampling procedure. And just as Anderson "independently discovered [the Dirichlet Process], deriving this distribution from first principles" (Sanborn et al., 2010: 1151; citing Neal, 1998) and then developed an approximation algorithm for it, it may be the case that Pursuit is in fact an analogous approximation algorithm for some parametric Bayesian model.

(cntd.)

(post 2/2)

ReplyDeleteNow, if we just cared about empirical coverage and didn't care about understanding why we obtain that empirical coverage, this would be totally uninteresting. And it's completely possible that the converse of this does not hold. In fact, I think this is maybe what you're arguing Charles. But I agree with Mark's second point: modifying preexisting code often increases the chances of compatibility, even if the new code is completely novel in concept. One major reason the connection between Anderson's (1990, 1991) model+algorithm and the Dirichlet Process can be seen so clearly is that he used preexisting code.

For instance, it makes it clear why we might be interested in comparing Local MAP with different particle filters. Both Local MAP and particle filters with different memory sizes do quite well at the sort of categorization tasks Anderson was interested in. But where they differ is in how well they handle order effects. Local MAP as a model of categorization turns out to be more suceptible to order effects (and small changes in probability) than humans actually are, where particle filters with small memories make better predictions. It is of course an empirical question whether this is relevant to the word-learning problem, since we would need to know what the sequential structure of word-meaning pairs looks like over the course of learning. But even given we had the data, we would need to know what the relevant class of alternative algorithms was to do the right comparison. And a principled way of getting to this class is to figure out which computational level model the one that we are interested in is approximating. That is, we can call an orange an orange; but let's also understand the orange tree.

Sorry for the delay in replying; I'm at a conference with slow internet and many interesting people, so I'm having trouble keeping up with my electronic responsibilities.

ReplyDeleteI suspect we're still a long way from understanding the role of the various inputs in language acquisition. Still, I think it's fairly clear that acquisition involves integrating multiple sources of information, if only because as far as we can tell no single source contains enough information on its own.

So I think we need tools for studying such information integration, and explicit probabilistic models (especially Bayesian models) seem to be one good tool for doing this. Maybe we haven't been looking at the right combinations of information sources; and if we've got it wrong I'd like to invite Charles to suggest what they might be. (I'll admit we're doing a certain amount of "looking under the lamp-post"; we're building probabilistic models of phenomena we can get data for).

Yes, ultimately we want descriptions at all levels of the Marr hierarchy. But I think the alternatives proliferate as we go lower (i.e., there are many algorithms that approximately implement a given probabilistic model, and many wetware implementations of each algorithm), and generally the lower we go the less we know. (I think the biggest and most important open question at the implementation level is: how are hierarchical structures (e.g., trees) represented and manipulated in neural circuitry? -- as far as I can tell nobody has the faintest idea). So my preferred research strategy is to work at the higher level of abstraction of probabilistic models; I suspect the chances of us guessing the correct algorithm has got to be close to zero (in fact, I wouldn't be surprised if cognitively realistic algorithms are quite unlike conventional computer algorithms and more like neural network runs, but that's just a hunch).

Hi Mark,

DeleteI think we agree more than we disagree.

How does acquisition make use of multiple source of information? I think we can only study it case by case. (No, I don't have any particularly insightful suggestion, never mind the brain stuff; we all read the same papers.) Empirical work, often by design, has limitations on showing how cues interact. Suppose I want to show cue A works, I design my stimuli to neutralize the contribution from B, C, and D, which precludes the question of interaction. That's why experiments that make cues collide--such as Johnson & Jusczyk, Shkula et al, and John and Lila's work--are so important.

Like you, I think computational modeling can help understand the contributions from multiple sources. But modeling results, like empirical ones, also leave room for uncertainty. Suppose a cue helps the computational model: by adding it, we improve the F-score by 2% (and the computational linguist in me is very happy.) But does it mean that humans use that cue as well? Not necessarily. The role of statistical information, or more precisely transitional probability, is a case in point. It is applicable to segmentation when structural cues are unavailable but do not seem to be applied when structural cues are present. Likewise, John and Lila's work shows no evidence of multiple scene tracking, hence we built Pursuit with both hands tied behind back. Now these specific conclusions could all be wrong, and we would have to revise the computational models accordingly But I think this situation is probably typical of language acquisition: there are patterns and regularities in the data, some of which may help a computational model but are altogether passed over the child learner.

I suppose the challenge for all of us is to get the computational and empirical balance right.