Comments on Faculty of Language: More on Word Acquisition

Hi Mark, I think we agree more than we disagree. ...

2013-08-06T09:56:53.157-07:00

Hi Mark,

I think we agree more than we disagree.

How does acquisition make use of multiple source of information? I think we can only study it case by case. (No, I don't have any particularly insightful suggestion, never mind the brain stuff; we all read the same papers.) Empirical work, often by design, has limitations on showing how cues interact. Suppose I want to show cue A works, I design my stimuli to neutralize the contribution from B, C, and D, which precludes the question of interaction. That's why experiments that make cues collide--such as Johnson & Jusczyk, Shkula et al, and John and Lila's work--are so important.

Like you, I think computational modeling can help understand the contributions from multiple sources. But modeling results, like empirical ones, also leave room for uncertainty. Suppose a cue helps the computational model: by adding it, we improve the F-score by 2% (and the computational linguist in me is very happy.) But does it mean that humans use that cue as well? Not necessarily. The role of statistical information, or more precisely transitional probability, is a case in point. It is applicable to segmentation when structural cues are unavailable but do not seem to be applied when structural cues are present. Likewise, John and Lila's work shows no evidence of multiple scene tracking, hence we built Pursuit with both hands tied behind back. Now these specific conclusions could all be wrong, and we would have to revise the computational models accordingly But I think this situation is probably typical of language acquisition: there are patterns and regularities in the data, some of which may help a computational model but are altogether passed over the child learner.

I suppose the challenge for all of us is to get the computational and empirical balance right.

Sorry for the delay in replying; I'm at a conf...

2013-08-06T02:31:07.773-07:00

Sorry for the delay in replying; I'm at a conference with slow internet and many interesting people, so I'm having trouble keeping up with my electronic responsibilities.

I suspect we're still a long way from understanding the role of the various inputs in language acquisition. Still, I think it's fairly clear that acquisition involves integrating multiple sources of information, if only because as far as we can tell no single source contains enough information on its own.

So I think we need tools for studying such information integration, and explicit probabilistic models (especially Bayesian models) seem to be one good tool for doing this. Maybe we haven't been looking at the right combinations of information sources; and if we've got it wrong I'd like to invite Charles to suggest what they might be. (I'll admit we're doing a certain amount of "looking under the lamp-post"; we're building probabilistic models of phenomena we can get data for).

Yes, ultimately we want descriptions at all levels of the Marr hierarchy. But I think the alternatives proliferate as we go lower (i.e., there are many algorithms that approximately implement a given probabilistic model, and many wetware implementations of each algorithm), and generally the lower we go the less we know. (I think the biggest and most important open question at the implementation level is: how are hierarchical structures (e.g., trees) represented and manipulated in neural circuitry? -- as far as I can tell nobody has the faintest idea). So my preferred research strategy is to work at the higher level of abstraction of probabilistic models; I suspect the chances of us guessing the correct algorithm has got to be close to zero (in fact, I wouldn't be surprised if cognitively realistic algorithms are quite unlike conventional computer algorithms and more like neural network runs, but that's just a hunch).

(post 2/2) Now, if we just cared about empirical ...

2013-08-04T09:39:52.541-07:00

(post 2/2)

Now, if we just cared about empirical coverage and didn't care about understanding why we obtain that empirical coverage, this would be totally uninteresting. And it's completely possible that the converse of this does not hold. In fact, I think this is maybe what you're arguing Charles. But I agree with Mark's second point: modifying preexisting code often increases the chances of compatibility, even if the new code is completely novel in concept. One major reason the connection between Anderson's (1990, 1991) model+algorithm and the Dirichlet Process can be seen so clearly is that he used preexisting code.

For instance, it makes it clear why we might be interested in comparing Local MAP with different particle filters. Both Local MAP and particle filters with different memory sizes do quite well at the sort of categorization tasks Anderson was interested in. But where they differ is in how well they handle order effects. Local MAP as a model of categorization turns out to be more suceptible to order effects (and small changes in probability) than humans actually are, where particle filters with small memories make better predictions. It is of course an empirical question whether this is relevant to the word-learning problem, since we would need to know what the sequential structure of word-meaning pairs looks like over the course of learning. But even given we had the data, we would need to know what the relevant class of alternative algorithms was to do the right comparison. And a principled way of getting to this class is to figure out which computational level model the one that we are interested in is approximating. That is, we can call an orange an orange; but let's also understand the orange tree.

(post 1/2) That's right. Let's call an or...

2013-08-04T09:38:42.891-07:00

(post 1/2)

That's right. Let's call an orange an orange. But we are doing just that when we construct a Bayesian model and then specify different ways to investigate the posterior distribution of one or more of its parameters. We are just doing it as a part of our botanical investigation of the orange tree. As Mark points out, parameter spaces are often too large to carry out such an investigation exhaustively, so we resort to approximating algorithms. The advantage of having the computational level context specified for these algorithms---and this is Mark's point (or at least related)---is that we can then ask how different forms of approximation affect the empirical coverage. That is, we can ask why the "sub-optimal way of making use of information across situations...actually performs better" (Stevens, Yang, Trueswell & Gleitman 2013: 3) in certain contexts. I think the relevant oven analogy might be that we'd like to understand how an oven does its job for those times when we just have a fire and a spit but would like to get as close as possible to the oven's performance. It's true that we need to get the cooking done well. But the knowledge we gain about the relationship between oven-cooking and fire-and-spit-cooking gives us a way of understanding cooking as a whole.

In a similar vein to the Borschinger and Johnson paper that Mark links to, Sanborn, Griffiths & Navarro (2010; http://cocosci.berkeley.edu/tom/papers/rationalapproximations.pdf) have an excellent paper comparing different online approximation algorithms for category learning. One of the algorithms they discuss is Anderson's (1990, 1991), which they call Local MAP, since it samples from a Dirac distribution at the mode of posterior over categories given the previous datapoints and category samples. This isn't the same as Pursuit, since Pursuit assumes a distribution over a finite number of categories and Local MAP is nonparametric (and there are probably other differences in how belief updates are handled), but Pursuit does use a similar sampling procedure. And just as Anderson "independently discovered [the Dirichlet Process], deriving this distribution from first principles" (Sanborn et al., 2010: 1151; citing Neal, 1998) and then developed an approximation algorithm for it, it may be the case that Pursuit is in fact an analogous approximation algorithm for some parametric Bayesian model.

(cntd.)

Thought I would chime in, as this post touches on ...

2013-08-03T15:35:08.985-07:00

Thought I would chime in, as this post touches on my work, as well as topics of my general interest.

I am in complete agreement with Norbert and Mark, that probabilistic methods have proven utility for language acquisition. I took a fair amount of stick for it when I started out with probabilistic parameter setting back in the 90s, and still do, but at least I have company. And I have no problem with Bayesian methods either, having cut my my Bayesian teeth with folks like Mike Jordan. The reinforcement learning model I have been using for language acquisition and change can be straightforwardly recast in a Bayesian framework, which is very inclusive.

The only concern I have, and the goal of the Pursuit model (and John and Lila's experimental work), is to get the empirical matters of language learning right. Yes, Bayesian models are very good at incorporating multiple sources of information: the word segmentation work that Mark mentioned is a good example of that. But from the developmental point of view, it's not clear that this is what infants do. The work of Johnson & Jusczyk (2001, JML), and Shkula et al. (2011, PNAS), for instance, shows that when both prosodic and statistical information are available, infants use the former and ignore the latter. These results are quite representative of the acquisition literature. And the results from John, Lila, and colleagues' experiments also show the absence of information tracking over multiple scenes. Of course a Bayesian model can be construed to capture such findings, but why not call an orange an orange, and build a computational model, such as Pursuit, to transparently reflect this? For word segmentation, see the work of Constantine Lignos (http://www.aclweb.org/anthology-new/W/W11/W11-0304.pdf).

I have heard of, and used, the complexity argument before: Method X is too computationally intensive to be psychological plausible. But I haven't used this recently, because it is too easily defeated: "we don't know how the brain works", and that's a fair point.

For practical purposes--we want to get the results back before dinner!--approximation methods are probably the way to go, even though there are intractability results on approximate methods as well. But sometimes the online approximation method alone (such as Constantine's work) may be sufficient for the job, sans the more complex model it tries to approximate. It'd be like building an oven to light a candle, but the oven needs to be lit by a match.

Thx. As you might guess, this level of detail is b...

2013-08-03T14:01:05.498-07:00

Thx. As you might guess, this level of detail is beyond me. But I encourage the cognoscenti to jump in here. It would be nice to get a heads up from the pros on what to look at and why.

I'd like to encourage you to look again at Bay...

2013-08-03T13:41:30.439-07:00

I'd like to encourage you to look again at Bayesian approaches and probabilistic models more generally, such as Mike Frank's and the ones we've developed in work on "Adaptor Grammars". Yes, some of these models have complicated mathematical derivations, which is often used as an excuse to ignore them as "psychologically unrealistic". But I claim this is a level confusion!

A probabilistic model is what Marr called a "computational model"; it specifies the different kinds of information involved and how they interact. For models of potentially unbounded things such as the lexicon the maths here is still quite new and unfamiliar to many. I've heard the apparent complexity of this maths used as argument against Bayesian models, but this is as misguided as complaining that there's no way the planets could be doing anything as complicated as solving differential equations when they move around the sun.

One interesting property of probabilistic models is that there are a variety of relatively generic inference algorithms for inference ("learning"). We know there are trade-offs here: the computationally-intensive models you complain about are so intensive precisely because they are searching for optimal solutions.

But there are a variety of other Bayesian algorithms that trade optimality for other properties, such as "on-line" learning (i.e., exactly one pass through the data, see e.g. http://aclweb.org/anthology-new/U/U11/U11-1004.pdf).

Interestingly, some of these algorithms have the kind of properties you laud in your article (e.g., they maintain only a single analysis of each data item, rather than tracking all possible combinations), and in fact many of the algorithms you speak of approvingly can be viewed as approximations to some of these Bayesian algorithms.

A more theoretical approach such as a Bayesian one provides two big advantages over more ad hoc approaches.

First, it helps us understand why these algorithms have the structure they do (e.g., why do we smooth by adding rather than multiplying?). Understanding the approximations involved in the algorithm can help us understand what has happened when our models go wrong (and all our current models are wrong).

Second, and more importantly, it's usually easier to see how to extend and combine probabilistic models than algorithms. For example, associating words with topics as is done here is just one of many challenges in word learning; merely identifying the words in an utterance is a challenge. While there are algorithms for word segmentation, it's not at all obvious how to combine such an algorithm with the algorithm described in this post. However, at the level of probabilistic models this is not so challenging: we've shown how such models can be integrated, and perhaps more interestingly, we find synergies in their combination (i.e., word segmentation improves, as does learning word-topic mapping!) http://books.nips.cc/papers/files/nips23/NIPS2010_0530.pdf .

I could go on explaining how the algorithm for this model only keeps a small number of analyses for each possible word, but this post is long enough already ...