Faculty of Language: A Quick Thanksgiving Reply to Alex, Avery, and Noah

Wednesday, November 21, 2012

A Quick Thanksgiving Reply to Alex, Avery, and Noah

I am still playing impresario to Bob Berwick's arias. Enjoy! What follows is all Robert C. Berwick:

Excellent questions all; each deserves a reply in itself. But for now, we’ll have to content ourselves with this Thanksgiving appetizer, with more to come. The original post made just 2 simple points: (1) wrt Gold, virtually everyone who’s done serious work in the field immediately moved to a stochastic setting, > 40 years ago, in the best case coupling it to a full-fledged linguistic theory (e.g., Wexler and Degree-2 learnability); and (2) that simply adding probabilities to get PCFGs doesn’t get us out of the human language learning hot water. So far as I can make out, none of the replies to date have really blunted the force of these two points. If anything, the much more heavy-duty (and much more excellent and subtle) statistical armamentarium that Shay Cohen & Noah Smith bring to bear on the PCFG problem actually reinforces the second message. Evidently, one has to use sophisticated estimation techniques to get sample complexity bounds, and even then, one winds up with an unsupervised learning method that is computationally intractable, solvable only by approximation (a point I’ll have to take up later).

Now, while I personally find such results enormously valuable, as CS themselves say in their introduction, they’re considering probabilistic grammars that are “used in diverse NLP applications…” “ranging from syntactic and morphological processing to applications like information extraction, question answering, and machine translation.” Here, I suspect, is where my point of view probably diverges from what Alex and Noah subscribe to (though they ought to speak for themselves of course): what counts a result? I can dine on Cohen and Smith’s fabulous statistical meal, but then I walk away hungry: What does it tell us about human language and human language acquisition that we did not already know? Does it tell us why, e.g., in a sentence like He said Ted criticized Morris, that he must be used deictically, and can’t be the same person as Morris? And, further, why this constraint just happens to mirror the one in sentences such as Who did he say Ted criticized, where again he must be deictic, and can’t be who; and, further, why this just happens to mirror the one in sentences such as, He said Ted criticized everyone, where again he must be deictic, and can’t mean everybody; and, then finally, and most crucially of all, why these same constraints appear to hold, not only for English but for every language we’ve looked at, and, further, how it is that children come to know all this before age 3? (Examples from Stephen Crain by way of Chomsky.) Now that, as they say, is a consummation devoutly to be wished. And yet that’s what linguistic science currently can deliver. Now, if one's math-based approach account could do the same – why this pattern must hold, ineluctably, well, then that would be a Happy Meal deserving of the name. But this, for me, is what linguistic science is all about. In short, I hunger for explanations in the usual scientific sense, rather than theorems about formal systems. Explanations like the ones we have about other biological systems, or the natural world generally. Again, remember, this is what Ken Wexler's Degree-2 theory bought us – to ensure feasible learnability, one had to impose locality constraints on grammars that are empirically attested. In this regard it seems my taste runs along the lines of Avery Andrews’ comment regarding the disability placard to be awarded to PCFGs. (Though as I’ll write in a later post, his purchase of Bayes as a “major upgrade over Chomsky” in fact turns out, perhaps surprisingly, that he’s bought the older, original model.) Show me a better explanation and I’ll follow you anywhere.

As for offering constructive options, well, but of course! The original post mentioned two: the first, Ken Wexler's degree-2 learnability demonstration with an EST-style TG; and the second, Mark Steedman's more recent combinatory categorial grammar approach (which, as I've conjectured just the other day with Mark, probably has a degree-1 learnability proof. (And in fact both of these probably have nice Vapnik-Chervonenkis learnability results hiding in there somewhere – something I’m confident CS could make quick work of, given their obvious talents.)) The EST version badly wants updating, so, there’s plenty of work to do. Time to be fruitful and multiply.

16 comments:

Alex ClarkNovember 25, 2012 at 12:42 PM
It is entirely reasonable to criticise the positive PCFG learning results on the grounds that they fail to address computational complexity considerations. For me, computation is the central problem -- not the lack of negative evidence issue which is a sideshow in the light of a probabilistic view of learning -- we agree on that.

But do either the Wexler and Culicover work or the Kwiatkowski et al (inc. Steedman) have a satisfactory answer to this problem?

Moreover the W & C work is based on very unrealistic inputs -- if I recall correctly, (and I may not) they assume a deep structure tree as input. Is that a reasonable assumption in your view? Is the assumption of learning from sound/meaning pairs (as in the Edinburgh work) reasonable as a model of language acquisition?
ReplyDelete
Replies
NorbertNovember 26, 2012 at 7:41 AM
RCB replies:

It’s good to know that for some, the lack of negative evidence “becomes a sideshow” resolved by probabilistic learning. Unhappily, this doesn't help me sleep any better at night, because I don’t subscribe to this view. None of the responses so far has offered any explanation for how it is that children acquire knowledge that’s never attested in adult input, or knowledge that runs counter to adult input – so that probability matching simply fails (including all the much-touted Bayesian models known to me). And by this I mean an explanation of, e.g., how children by age 3 come to know that pronoun anaphora, wh-crossover examples, and quantifier-pronoun anaphora all pattern alike. There are many more examples; Stephen Crain's latest book is chock full of them. This is the bread-and-butter of modern linguistic theory, of course – as Crain shows, linguistic theory does explain all this. I’ll touch on this in a later post.

To recall a bit more history, in 1970, Ken Wexler and Henry Hamburger first tried for a learnability proof working only from surface strings – and showed this was not possible, for EST style transformational grammars. (It’s in the chapter immediately following Ken’s measure-1 learnability account.) Good scientist that he is, Ken then resorted to (base, surface string) pairs – and wound up writing nearly 100 pages in his 1980 book justifying this strong medicine, which he, along with everyone else, knew was too strong.
But to stop here misses the point. As Ken stressed, even when given this enriched data, one still requires substantive, empirically-verified constraints on the space of possible grammars to establish learnability – locality constraints on movement, as mentioned in previous posts – and several others besides. That’s what we already knew 40 years ago. Thirty years ago (shameless plug follows), I tried to improve on this by reducing base structure to just any rough thematic argument structure – again, computational feasibility was the goal. Again strong locality constraints were required. Can we now do better? I hope so. But here it seems to me that Ken’s strategy still applies: one starts with a known linguistic theory – which is to say, one that we already know has attained some degree of explanatory adequacy – and builds a learning infrastructure around that.
ReplyDelete
Replies
Alex ClarkNovember 26, 2012 at 8:50 AM
I was just talking about the lack of negative evidence -- not things like probability matching etc. Why doesn't the switch to probabilistic learning deal with the lack of negative evidence? If it doesn't then how is it different from Gold learning?

Wexler's strategy (I don't feel I can say Ken's) is problematic -- because the theory may be wrong. In which case one's learning infrastructure is irrelevant. Now EST has been completely abandoned so I don't see why looking at the learning of EST is a good idea.
I like Stabler's MGs -- which seem not to have the problems of EST. I don't know what 'attained some degree of explanatory adequacy' means -- I don't think MGs have a learning theory yet -- but it seems better to target a live theory rather than resuscitate an inanimate corpse. Which would only give you a grammatical zombie ....

The natural question is whether, if you are studying learnability of MGs, whether you should consider the inputs to be sound/meaning pairs or derivation trees or just strings -- I'd be interested in your view on that.
ReplyDelete
Replies
rcbNovember 26, 2012 at 6:27 PM
Avery, that's a good source. But I also think one might bear in mind the original - and that is
Carl deMarcken's MIT thesis, which learned morphological forms from phonemic transcriptions. (Carl assumed there was some front-end 'speech recognizer' that could map a raw speech signal stream to phonemes. He had to start somewhere.) This was
based on minimum description length (deMarcken, 1996). Goldsmith's work flows from this, as John will tell you. In fact, Carl did much, much more than this (but my prior is biased: I supervised his thesis). He even had a component that learned the mappings to both syntax and semantics. Most of that did not surface in his thesis, so I'll post links both to his thesis and to those missing bits later. Carl's work will join the discussion again when there's time to post about Bayesian learning methods and minimum description length - promises, promises, I know. But I'll make good on this promissory note soon.
ReplyDelete
Replies

Add comment

Faculty of Language

Comments

Wednesday, November 21, 2012

A Quick Thanksgiving Reply to Alex, Avery, and Noah

16 comments:

Contributors